[vox-tech] Parsing Html

Mike Simons vox-tech@lists.lugod.org
Wed, 11 Jun 2003 17:54:31 -0400


--envbJBWh7q8WU6mo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jun 11, 2003 at 04:00:06PM -0500, Jay Strauss wrote:
> Found  HTML::TableContentParser which does some of the heavy lifting for =
me
> playing with it now

> > http://quote.cboe.com/QuoteTable.asp?TICKER=3Dqqq&ALL=3D2
> >
> > It seems like there would be a cpan thing to read in a string (html), t=
hen
> > would let me navigate.  That is, give me the third table, give me the
> > first row, give me the first table data

  save the html into a file with wget, then feed that as an argument to
the perl below... if you want the calls and puts broken into separate
arrays or into hashes it should be easy from here.

  I would do a cleaner example (like pulling the page with LWP, and=20
storing the data into a hash) if I thought I'd get paid for it.  ;)

    TTFN,
      Mike

ps: if you want to see what each step is doing to the data, put a=20
"print $_;" line and pipe the output into less, so you can see the
null characters clearly.  This is a very simple table, I normally need
to use \00, \01, \02, etc... to mark different chunks of data, so that
after the html is gone I can identify what was what..

=3D=3D=3D=3D
#! /usr/bin/perl -w

$_ =3D join '', <>;                    # suck in the html

s#^.*<!--Start Options Table-->##s;  # strip before interest
s#<!--End Options Table-->.*##s;     # strip after interest
s#^.*(<table)#$1#is;                 # fine tune strip before

s#<td[^>]*?>##g;                     # nuke table data starts
s#</td[^>]*?>#\00#g;                 # mark table data stops

s#[\r\n]##g;                         # nuke return and newline
s#\s+# #g;                           # nuke multiple spaces

s#<tr[^>]*?>##g;                     # nuke table record starts
s#</tr[^>]*?>#\n#g;                  # mark table record stops

s#</?[^>]*?>##g;                     # nuke all remaining html
s#^ ##mg;                            # nuke leading spaces

foreach $line (split '\n', $_) {     # work on each table record
  @ray =3D split "\0", $line;          # split based on data marks
  next if (@ray !=3D 14);              # ignore incomplete rows

  printf "%-23s %-9s %-5s %-5s %-5s %-4s %-8s " .
         "%-24s %-9s %-5s %-5s %-5s %-5s %-8s\n",
    @ray;                            # print the data nicely.
}
=3D=3D=3D=3D

--=20
GPG key: http://simons-clan.com/~msimons/gpg/msimons.asc
Fingerprint: 524D A726 77CB 62C9 4D56  8109 E10C 249F B7FA ACBE

--envbJBWh7q8WU6mo
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE+56UX4Qwkn7f6rL4RArXSAJ94281aWTFAdWw5+dS0ZqFLmdBlgACcCcGi
7YuroM6cOE8L46gTmFVKM6Y=
=Xp9y
-----END PGP SIGNATURE-----

--envbJBWh7q8WU6mo--