Won't this just match identical records, and not account for his
matching rules? Seems like this is just a programmatic SQL join
On Wed, Jul 06, 2005 at 03:08:38PM -0700, Lango, Trevor M. wrote:
I have two lists, not necessarily of the same length. List #1 has two
columns. List #2 has one column. I would like to do the following:
Scan list #1 line by line. If a match for column #1 in list #1 is found
in list #2, extract the matching lines and put them in a new list (#3).
Otherwise, leave the contents of lists #1 and #2 as they are.
If I expected the contents of the first column of each list to match
exactly (character for character) - this would be a simple task with C++
or the like. However, the contents will not necessarily be perfectly
identical. I do believe they are nearly identical enough though to use
pattern matching via Perl or the like. Personally this is difficult for
me (as a Perl noob), I know how to scan through a file for a
pre-determined pattern - I don't understand how to scan through a file
for a pattern that is essentially given by a line in another file...? I
have not found anything in my reading of Perl documentation that
explains how to read a file and use its contents as an argument for the
pattern to search for in another file (suggestions on excellent Perl doc
sources appreciated also!).
Perl allows you to build up a regex pattern in a scalar, and then use
that scalar to match against a string. For instance:
@words = qw( foo bar );
$pattern = join('|', @words);
$line =~ $pattern; # Does $line match 'foo|bar'?
I'm not sure if this is actually the most useful mechanism for what you
This is what the contents of the lists may look like:
Examples of matching:
TALL0047A TALL047A match
TALL0047A TAL0047A not a match
TALL0047A TAL0470A not a match
The contents will always be one to four alpha characters followed by one
to four numeric characters possibly followed by one or two alpha
A match would be defined as the following criteria being met:
- The last one to four digits being identical (excluding leading zeroes)
- The first one to four letters being identical
It is not entirely clear to me what role the final 'A' character has in
determining a match. However, I would recommend the following alogirthm.
First, you will need to create a comparison function, that will return
an integer less than, greater than, or equal to zero depending on
whether its first argument compares lexically less than, greater than,
or equal to its second argument. This function should match based on
your rules given above, and is meant as an analog for the cmp and <=>
Next, sort both lists. Start with an index into the first element of
list 1 ($i), and an index into the first element of list 2 ($j).
If $list_one[$i] compares equal to $list_two[$j], you have your match:
remove that element from both lists, and append it to the new one.
If $list_one[$i] compares less than $list_two[$j], then advance $i and
If $list_one[$i] compares greater than $list_two[$j], then advance $j
and compare again.
If either $i or $j falls off the end, you are done. This algorithm
has linear complexity-- O(N), where N is the number of elements in the
longer of the two lists. Not counting the complexity involved for
whatever sort algorithm you use. If you use Perl's sort function, you'll
get linearithmic time (N log N).
Test the hell outta your code before relying upon it. :-)