[vox-tech] Parsing Html
Mike Simons
vox-tech@lists.lugod.org
Wed, 11 Jun 2003 22:15:13 -0400
--sHrvAb52M6C8blB9
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Wed, Jun 11, 2003 at 06:41:00PM -0700, Michael Wenk wrote:
> On Wednesday 11 June 2003 02:54 pm, Mike Simons wrote:
> > On Wed, Jun 11, 2003 at 04:00:06PM -0500, Jay Strauss wrote:
> > > > http://quote.cboe.com/QuoteTable.asp?TICKER=3Dqqq&ALL=3D2
> >
> > save the html into a file with wget, then feed that as an argument to
> > the perl below... if you want the calls and puts broken into separate
> > arrays or into hashes it should be easy from here.
>=20
> Silly question, why not open the wget in the open call ? Ie:
>=20
> open FOOF, "wget -q -O - <url> |";=20
> while (<FOOF>) {=20
> $file .=3D $_;
> }
> $_ =3D $file;=20
> ...
>=20
> I realize using something fun like LWP would be better, but this would be=
the=20
> poor mans way of doing it I would think...
Michael,=20
Good question... here are some reasons I didn't put the fetch in the code.
- I normally develop a script like that via multiple passes, adding
one line of s### junk at a time, sending the output to less. Even
over DSL it takes much longer to develop if the page get's pulled for
run... since it's not in it's final form anyway it's better to have
Jay add the fetching later if he wants.
- I don't really like relying on shell commands to be there, so
since perl can do most anything itself I like using perl stuff if
available (like LWP).
- I already have sent samples of the LWP fetching method to vox-tech=20
along with code to use a local cache of fetched pages which makes
development much faster, but that code would overwhelm the simple
"how to parse a html table" question Jay asked and I wasn't going
to get any bonus points for including it anyway (*).
=20
With all that said putting a wget much like you mentioned is a great
idea for the final version. I would do something like if I had to use
wget:
=3D=3D=3D
open GET, "wget -q -O - '$url' |" or die "wget failed on: '$url'";=20
$_ =3D join '', <GET>;
close GET;
=3D=3D=3D
TTFN,
Mike
*: The fancy LWP caching code is in the thread about downloading the=20
Debian Bug List of stupid packages like apt (which have many hundreds=20
of open bugs).
http://simons-clan.com/~msimons/debian/howto/dldb/
--=20
GPG key: http://simons-clan.com/~msimons/gpg/msimons.asc
Fingerprint: 524D A726 77CB 62C9 4D56 8109 E10C 249F B7FA ACBE
--sHrvAb52M6C8blB9
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE+5+Ix4Qwkn7f6rL4RAsmgAJ0Zn3eBsj9YIsEcR2YYx/ZcDhNU2wCeIKnp
+KNMo8zdql+W07q784zQ+3o=
=3T2b
-----END PGP SIGNATURE-----
--sHrvAb52M6C8blB9--