l i n u x - u s e r s - g r o u p - o f - d a v i s
Next Meeting:
July 7: Social gathering
Next Installfest:
Latest News:
Jun. 14: June LUGOD meeting cancelled
Page last updated:
2003 Jun 11 18:41

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] Parsing Html
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] Parsing Html

Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jun 11, 2003 at 04:00:06PM -0500, Jay Strauss wrote:
> Found  HTML::TableContentParser which does some of the heavy lifting for =
> playing with it now

> > http://quote.cboe.com/QuoteTable.asp?TICKER=3Dqqq&ALL=3D2
> >
> > It seems like there would be a cpan thing to read in a string (html), t=
> > would let me navigate.  That is, give me the third table, give me the
> > first row, give me the first table data

  save the html into a file with wget, then feed that as an argument to
the perl below... if you want the calls and puts broken into separate
arrays or into hashes it should be easy from here.

  I would do a cleaner example (like pulling the page with LWP, and=20
storing the data into a hash) if I thought I'd get paid for it.  ;)


ps: if you want to see what each step is doing to the data, put a=20
"print $_;" line and pipe the output into less, so you can see the
null characters clearly.  This is a very simple table, I normally need
to use \00, \01, \02, etc... to mark different chunks of data, so that
after the html is gone I can identify what was what..

#! /usr/bin/perl -w

$_ =3D join '', <>;                    # suck in the html

s#^.*<!--Start Options Table-->##s;  # strip before interest
s#<!--End Options Table-->.*##s;     # strip after interest
s#^.*(<table)#$1#is;                 # fine tune strip before

s#<td[^>]*?>##g;                     # nuke table data starts
s#</td[^>]*?>#\00#g;                 # mark table data stops

s#[\r\n]##g;                         # nuke return and newline
s#\s+# #g;                           # nuke multiple spaces

s#<tr[^>]*?>##g;                     # nuke table record starts
s#</tr[^>]*?>#\n#g;                  # mark table record stops

s#</?[^>]*?>##g;                     # nuke all remaining html
s#^ ##mg;                            # nuke leading spaces

foreach $line (split '\n', $_) {     # work on each table record
  @ray =3D split "\0", $line;          # split based on data marks
  next if (@ray !=3D 14);              # ignore incomplete rows

  printf "%-23s %-9s %-5s %-5s %-5s %-4s %-8s " .
         "%-24s %-9s %-5s %-5s %-5s %-5s %-8s\n",
    @ray;                            # print the data nicely.

GPG key: http://simons-clan.com/~msimons/gpg/msimons.asc
Fingerprint: 524D A726 77CB 62C9 4D56  8109 E10C 249F B7FA ACBE

Content-Type: application/pgp-signature
Content-Disposition: inline

Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org


vox-tech mailing list

LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
O'Reilly and Associates
For numerous book donations.