[vox-tech] Matching Contents of Lists

Jay Strauss me at heyjay.com
Fri Jul 8 08:28:00 PDT 2005


Micah J. Cowan wrote:
> On Wed, Jul 06, 2005 at 03:08:38PM -0700, Lango, Trevor M.  wrote:
> 
>>I have two lists, not necessarily of the same length.  List #1 has two
>>columns.  List #2 has one column.  I would like to do the following:
>>
>>Scan list #1 line by line.  If a match for column #1 in list #1 is found
>>in list #2, extract the matching lines and put them in a new list (#3).
>>Otherwise, leave the contents of lists #1 and #2 as they are.
>>
>>If I expected the contents of the first column of each list to match
>>exactly (character for character) - this would be a simple task with C++
>>or the like.  However, the contents will not necessarily be perfectly
>>identical.  I do believe they are nearly identical enough though to use
>>pattern matching via Perl or the like.  Personally this is difficult for
>>me (as a Perl noob), I know how to scan through a file for a
>>pre-determined pattern - I don't understand how to scan through a file
>>for a pattern that is essentially given by a line in another file...?  I
>>have not found anything in my reading of Perl documentation that
>>explains how to read a file and use its contents as an argument for the
>>pattern to search for in another file (suggestions on excellent Perl doc
>>sources appreciated also!).
> 
> 
> Perl allows you to build up a regex pattern in a scalar, and then use
> that scalar to match against a string. For instance:
> 
>   @words = qw( foo bar );
>   $pattern = join('|', @words);
>   $line =~ $pattern; # Does $line match 'foo|bar'?
> 
> I'm not sure if this is actually the most useful mechanism for what you
> want, though.
> 
> 
>>This is what the contents of the lists may look like:
>>
>>TALL0047A
>>TAL0047A
>>TAL047A
>>TAL47A
>>TA0047A
>>TA047A
>>TA47A
>>T0047A
>>T047A
>>T47A
>>T0047
>>T047
>>T47
>>
>>Examples of matching:
>>
>>TALL0047A    TALL047A    match
>>TALL0047A    TAL0047A	    not a match
>>TALL0047A    TAL0470A	    not a match
>>
>>
>>The contents will always be one to four alpha characters followed by one
>>to four numeric characters possibly followed by one or two alpha
>>characters.
>>
>>A match would be defined as the following criteria being met:
>>
>>- The last one to four digits being identical (excluding leading zeroes)
>>- The first one to four letters being identical
> 
> 
> It is not entirely clear to me what role the final 'A' character has in
> determining a match. However, I would recommend the following alogirthm.
> 
> First, you will need to create a comparison function, that will return
> an integer less than, greater than, or equal to zero depending on
> whether its first argument compares lexically less than, greater than,
> or equal to its second argument. This function should match based on
> your rules given above, and is meant as an analog for the cmp and <=>
> operators.
> 
> Next, sort both lists. Start with an index into the first element of
> list 1 ($i), and an index into the first element of list 2 ($j).
> 
> If $list_one[$i] compares equal to $list_two[$j], you have your match:
> remove that element from both lists, and append it to the new one.
> 
> If $list_one[$i] compares less than $list_two[$j], then advance $i and
> compare again.
> 
> If $list_one[$i] compares greater than $list_two[$j], then advance $j
> and compare again.
> 
> If either $i or $j falls off the end, you are done. This algorithm
> has linear complexity-- O(N), where N is the number of elements in the
> longer of the two lists. Not counting the complexity involved for
> whatever sort algorithm you use. If you use Perl's sort function, you'll
> get linearithmic time (N log N).
> 
> Test the hell outta your code before relying upon it. :-)
> 
> HTH,
> Micah

Won't this just match identical records, and not account for his 
matching rules?  Seems like this is just a programmatic SQL join

Jay


More information about the vox-tech mailing list