[vox-tech] Matching Contents of Lists

Micah J. Cowan micah at cowan.name
Wed Jul 6 16:07:11 PDT 2005


On Wed, Jul 06, 2005 at 03:08:38PM -0700, Lango, Trevor M.  wrote:
> I have two lists, not necessarily of the same length.  List #1 has two
> columns.  List #2 has one column.  I would like to do the following:
> 
> Scan list #1 line by line.  If a match for column #1 in list #1 is found
> in list #2, extract the matching lines and put them in a new list (#3).
> Otherwise, leave the contents of lists #1 and #2 as they are.
> 
> If I expected the contents of the first column of each list to match
> exactly (character for character) - this would be a simple task with C++
> or the like.  However, the contents will not necessarily be perfectly
> identical.  I do believe they are nearly identical enough though to use
> pattern matching via Perl or the like.  Personally this is difficult for
> me (as a Perl noob), I know how to scan through a file for a
> pre-determined pattern - I don't understand how to scan through a file
> for a pattern that is essentially given by a line in another file...?  I
> have not found anything in my reading of Perl documentation that
> explains how to read a file and use its contents as an argument for the
> pattern to search for in another file (suggestions on excellent Perl doc
> sources appreciated also!).

Perl allows you to build up a regex pattern in a scalar, and then use
that scalar to match against a string. For instance:

  @words = qw( foo bar );
  $pattern = join('|', @words);
  $line =~ $pattern; # Does $line match 'foo|bar'?

I'm not sure if this is actually the most useful mechanism for what you
want, though.

> This is what the contents of the lists may look like:
> 
> TALL0047A
> TAL0047A
> TAL047A
> TAL47A
> TA0047A
> TA047A
> TA47A
> T0047A
> T047A
> T47A
> T0047
> T047
> T47
> 
> Examples of matching:
> 
> TALL0047A    TALL047A    match
> TALL0047A    TAL0047A	    not a match
> TALL0047A    TAL0470A	    not a match
> 
> 
> The contents will always be one to four alpha characters followed by one
> to four numeric characters possibly followed by one or two alpha
> characters.
> 
> A match would be defined as the following criteria being met:
> 
> - The last one to four digits being identical (excluding leading zeroes)
> - The first one to four letters being identical

It is not entirely clear to me what role the final 'A' character has in
determining a match. However, I would recommend the following alogirthm.

First, you will need to create a comparison function, that will return
an integer less than, greater than, or equal to zero depending on
whether its first argument compares lexically less than, greater than,
or equal to its second argument. This function should match based on
your rules given above, and is meant as an analog for the cmp and <=>
operators.

Next, sort both lists. Start with an index into the first element of
list 1 ($i), and an index into the first element of list 2 ($j).

If $list_one[$i] compares equal to $list_two[$j], you have your match:
remove that element from both lists, and append it to the new one.

If $list_one[$i] compares less than $list_two[$j], then advance $i and
compare again.

If $list_one[$i] compares greater than $list_two[$j], then advance $j
and compare again.

If either $i or $j falls off the end, you are done. This algorithm
has linear complexity-- O(N), where N is the number of elements in the
longer of the two lists. Not counting the complexity involved for
whatever sort algorithm you use. If you use Perl's sort function, you'll
get linearithmic time (N log N).

Test the hell outta your code before relying upon it. :-)

HTH,
Micah


More information about the vox-tech mailing list