[vox-tech] Perl - reading fixed width formats

Ted Deppner ted at psyber.com
Thu Aug 13 09:20:19 PDT 2009


On Thu, Aug 13, 2009 at 7:50 AM, Peter Jay Salzman<p at dirac.org> wrote:
> I need to read in ~25000 files whose lines are in a fixed file format:

Hiya Pete.  I'll take a stab at this because I love perl.  I had a
background of using perl to analyze web logs.

> The naive method is looping over each file, incrementally reading 4 chars, 2
> chars, 5 chars, etc.  However, the slowest part of all this (I would think)
> would be the constant disk access for each field for each file.

By default perl does buffer all i/o for typical calls.  You can do
sysreads which could get away from both unix and perl buffering, but
that should be considered a special case.  Using perl to do bog
standard line oriented input, and then using substr to chop up that
input would result in nice buffering, and rather deterministic line
handling... which you don't get with sysread.

You could also slurp a file at a time into a perl array and iterate as
above over that input.  The only trade off here is memory consumption.
 In your case, I would suspect no significant improvement between the
two.

> And once everything is in an array, what's the most efficient way to take a
> string representing one line of a file, and break it down into fields?  Is
> there anything faster than substr()?

For the actual line handling I usually come down to three methods.
One, is using substr, which if the input is 100% guaranteed to be
perfect with no errors might work.  But keeping track of all those
numbers and offsets can be a chore and I've never yet seen a perfect
input file.  Second is using split, which, again, if your file has
single word fields will work well.  If you have anything where you
have names like "Ted Deppner" and "Peter Jay Salzman" for a field this
will mess things up.

The last method, which I prefer for complex inputs is to use a regular
expression to do the splitting.  You have the option of defining your
field by regex and then getting the resulting text.  I don't want to
insult your intelligence by assuming, however for everyone else's
benefit, here might be an example.

given input like:
2009-01-01 12:00:00 Ted Deppner              1,000.00 1234567890 d 1943

We'll assume the first two fields are a date and time, the third a
name, then an amount, a transaction id, and a type flag (depost or
withdrawl) and a final check id.

this would chop the fields (assuming the input line is in $line and
already chomp()ed).

my ($date, $time, $name, $amount, $id, $type, $check) =
   $line =~ m/^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2};\d{2}) (.*) ([0-9,.]+)
(\d+) ([wd]) (\d+)$/;

So the .* should be alarming to most people, however because we have
such a tightly bound regex at the end it won't be allowed to gobble up
too much.  With a simple if/elsif/then you can add a stateful check to
be certain if you're expecting the first line of input or the second
line you described, as well as output any line that fails to match,
which you can then adjust your regex (I do this one a lot).

Hope this is helpful.  Sounds like an interesting project.

PS when I did the web logs, the main problem there was the disk read
time.  I found a tremendous improvement by gzipping the input.  The
disk and CPU overhead to read compressed files is way less than
straight disk reads.  Another trick was to use a bash-ism, ie zcat
input.gz | tee >(grep something > output1) | tee >(grep somethingelse
> output2) ....  My perl replaced this idiom but I've still found it
cool.  Bash is great.


More information about the vox-tech mailing list