[vox-tech] Another Perl CSV question - long lines = segfault

Sat, 18 Oct 2003 12:01:42 -0700

Thanks to the folks who helped me 'fix' the CSV file the other day.

I'm now stumbling upon another problem, where Perl is segfaulting on long
lines.  If I skip CSV lines > 3500 characters, it goes on its merry way.
Lines nearing 4000 characters or more cause Perl itself to die (segfault)
when I try to parse using the following code (from "Perl Cookbook"):

sub parse_csv
{
  my $text = shift;
  my @fields = ();

  while ($text =~ m{
    # Either some non-quote/non-comma text:
    ([^"',]+)

     # Or...
     |

    # ...a double-quote field:
    " # field's opening quote; don't save this
     (  # now a field is either:
      (?:   [^"]    # non-quotes
          |      # or
            ""      # adjacent quote pairs
      )* # any number
     )
    "
  }gx)

  {
      if (defined $1) { $field = $1; }
      else { ($field = $2) =~ s/""/"/g; }

      push @fields, $field;
  }

  return @fields;
}

At first, I thought it was dying due to character combos (commas, quotes,
double-quotes, or what-have-you), but then I started examining line lengths
just before running the parse subroutine.

Looking at an strace, when I do this:

  open(F, $somefile);

  while (<F>)
  {
    ...
  }

...it's only reading 4096 bytes at a time.

Is there a way to increase that buffer, so that I can ensure it reads,
say, up to 20,000 bytes?  (The longest line in the file seems to be around
14,000 bytes; at least, that's what "wc -L" tells me)

In the meantime, I'll Google, since the books I'm looking at don't have
good examples of this. ;)

Thx!

-bill!

-- 
bill@newbreedsoftware.com                           Got kids?  Get Tux Paint! 
http://newbreedsoftware.com/bill/       http://newbreedsoftware.com/tuxpaint/