[vox-tech] Matching Contents of Lists

Lango, Trevor M. TLango at tsocorp.com
Wed Jul 6 15:08:38 PDT 2005


I have two lists, not necessarily of the same length.  List #1 has two
columns.  List #2 has one column.  I would like to do the following:

Scan list #1 line by line.  If a match for column #1 in list #1 is found
in list #2, extract the matching lines and put them in a new list (#3).
Otherwise, leave the contents of lists #1 and #2 as they are.

If I expected the contents of the first column of each list to match
exactly (character for character) - this would be a simple task with C++
or the like.  However, the contents will not necessarily be perfectly
identical.  I do believe they are nearly identical enough though to use
pattern matching via Perl or the like.  Personally this is difficult for
me (as a Perl noob), I know how to scan through a file for a
pre-determined pattern - I don't understand how to scan through a file
for a pattern that is essentially given by a line in another file...?  I
have not found anything in my reading of Perl documentation that
explains how to read a file and use its contents as an argument for the
pattern to search for in another file (suggestions on excellent Perl doc
sources appreciated also!).

This is what the contents of the lists may look like:

TALL0047A
TAL0047A
TAL047A
TAL47A
TA0047A
TA047A
TA47A
T0047A
T047A
T47A
T0047
T047
T47

Examples of matching:

TALL0047A    TALL047A    match
TALL0047A    TAL0047A	    not a match
TALL0047A    TAL0470A	    not a match


The contents will always be one to four alpha characters followed by one
to four numeric characters possibly followed by one or two alpha
characters.

A match would be defined as the following criteria being met:

- The last one to four digits being identical (excluding leading zeroes)
- The first one to four letters being identical


It is absolutely imperative that any algorithm used does not produce
false positives - if a line is extracted as a match - it must without a
doubt actually be a match.  It is not so critical if a possible match is
passed up.  The lists will contain thousands or tens of thousands of
entries - just looking for a clever way to automate as much of the
process as possible.  I expect to have to check a portion of the lists
by hand - I would simply appreciate reducing the number of lines that
have to be checked manually.  Perl seems ideal - I'm just not savvy
enough with it (yet!) to make it work.  I know there are some Perl gurus
lurking about in LUGOD so if any of you have a spare moment to lend this
some thought - thanks!



Thank you in advance for any suggestions!

- Trevor



More information about the vox-tech mailing list