l i n u x - u s e r s - g r o u p - o f - d a v i s
L U G O D
 
Next Meeting:
April 21: Google Glass
Next Installfest:
TBD
Latest News:
Mar. 18: Google Glass at LUGOD's April meeting
Page last updated:
2005 Jul 08 08:49

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
[vox-tech] Matching Contents of Lists
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[vox-tech] Matching Contents of Lists



I have two lists, not necessarily of the same length.  List #1 has two
columns.  List #2 has one column.  I would like to do the following:

Scan list #1 line by line.  If a match for column #1 in list #1 is found
in list #2, extract the matching lines and put them in a new list (#3).
Otherwise, leave the contents of lists #1 and #2 as they are.

If I expected the contents of the first column of each list to match
exactly (character for character) - this would be a simple task with C++
or the like.  However, the contents will not necessarily be perfectly
identical.  I do believe they are nearly identical enough though to use
pattern matching via Perl or the like.  Personally this is difficult for
me (as a Perl noob), I know how to scan through a file for a
pre-determined pattern - I don't understand how to scan through a file
for a pattern that is essentially given by a line in another file...?  I
have not found anything in my reading of Perl documentation that
explains how to read a file and use its contents as an argument for the
pattern to search for in another file (suggestions on excellent Perl doc
sources appreciated also!).

This is what the contents of the lists may look like:

TALL0047A
TAL0047A
TAL047A
TAL47A
TA0047A
TA047A
TA47A
T0047A
T047A
T47A
T0047
T047
T47

Examples of matching:

TALL0047A    TALL047A    match
TALL0047A    TAL0047A	    not a match
TALL0047A    TAL0470A	    not a match


The contents will always be one to four alpha characters followed by one
to four numeric characters possibly followed by one or two alpha
characters.

A match would be defined as the following criteria being met:

- The last one to four digits being identical (excluding leading zeroes)
- The first one to four letters being identical


It is absolutely imperative that any algorithm used does not produce
false positives - if a line is extracted as a match - it must without a
doubt actually be a match.  It is not so critical if a possible match is
passed up.  The lists will contain thousands or tens of thousands of
entries - just looking for a clever way to automate as much of the
process as possible.  I expect to have to check a portion of the lists
by hand - I would simply appreciate reducing the number of lines that
have to be checked manually.  Perl seems ideal - I'm just not savvy
enough with it (yet!) to make it work.  I know there are some Perl gurus
lurking about in LUGOD so if any of you have a spare moment to lend this
some thought - thanks!



Thank you in advance for any suggestions!

- Trevor

_______________________________________________
vox-tech mailing list
vox-tech@lists.lugod.org
http://lists.lugod.org/mailman/listinfo/vox-tech



LinkedIn
LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
facebook
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.