l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
September 2: Social gathering
Next Installfest:
TBD
Latest News:
Aug. 18: Discounts to "Velocity" in NY; come to tonight's "Photography" talk
Page last updated:
2005 Jul 08 12:38

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] Matching Contents of Lists
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] Matching Contents of Lists



Micah J. Cowan wrote:
On Wed, Jul 06, 2005 at 03:08:38PM -0700, Lango, Trevor M.  wrote:

I have two lists, not necessarily of the same length.  List #1 has two
columns.  List #2 has one column.  I would like to do the following:

Scan list #1 line by line.  If a match for column #1 in list #1 is found
in list #2, extract the matching lines and put them in a new list (#3).
Otherwise, leave the contents of lists #1 and #2 as they are.

If I expected the contents of the first column of each list to match
exactly (character for character) - this would be a simple task with C++
or the like.  However, the contents will not necessarily be perfectly
identical.  I do believe they are nearly identical enough though to use
pattern matching via Perl or the like.  Personally this is difficult for
me (as a Perl noob), I know how to scan through a file for a
pre-determined pattern - I don't understand how to scan through a file
for a pattern that is essentially given by a line in another file...?  I
have not found anything in my reading of Perl documentation that
explains how to read a file and use its contents as an argument for the
pattern to search for in another file (suggestions on excellent Perl doc
sources appreciated also!).

Perl allows you to build up a regex pattern in a scalar, and then use
that scalar to match against a string. For instance:

  @words = qw( foo bar );
  $pattern = join('|', @words);
  $line =~ $pattern; # Does $line match 'foo|bar'?

I'm not sure if this is actually the most useful mechanism for what you
want, though.


This is what the contents of the lists may look like:

TALL0047A
TAL0047A
TAL047A
TAL47A
TA0047A
TA047A
TA47A
T0047A
T047A
T47A
T0047
T047
T47

Examples of matching:

TALL0047A    TALL047A    match
TALL0047A    TAL0047A	    not a match
TALL0047A    TAL0470A	    not a match


The contents will always be one to four alpha characters followed by one
to four numeric characters possibly followed by one or two alpha
characters.

A match would be defined as the following criteria being met:

- The last one to four digits being identical (excluding leading zeroes)
- The first one to four letters being identical

It is not entirely clear to me what role the final 'A' character has in
determining a match. However, I would recommend the following alogirthm.

First, you will need to create a comparison function, that will return
an integer less than, greater than, or equal to zero depending on
whether its first argument compares lexically less than, greater than,
or equal to its second argument. This function should match based on
your rules given above, and is meant as an analog for the cmp and <=>
operators.

Next, sort both lists. Start with an index into the first element of
list 1 ($i), and an index into the first element of list 2 ($j).

If $list_one[$i] compares equal to $list_two[$j], you have your match:
remove that element from both lists, and append it to the new one.

If $list_one[$i] compares less than $list_two[$j], then advance $i and
compare again.

If $list_one[$i] compares greater than $list_two[$j], then advance $j
and compare again.

If either $i or $j falls off the end, you are done. This algorithm
has linear complexity-- O(N), where N is the number of elements in the
longer of the two lists. Not counting the complexity involved for
whatever sort algorithm you use. If you use Perl's sort function, you'll
get linearithmic time (N log N).

Test the hell outta your code before relying upon it. :-)

HTH,
Micah
Won't this just match identical records, and not account for his matching rules? Seems like this is just a programmatic SQL join

Jay
_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.