[vox-tech] Suggestions for cleaning up repetitive HTML tags?
Bill Kendrick
nbs at sonic.net
Wed Aug 18 10:48:58 PDT 2010
I've come across some documents that are formatted in
such a way that, when converted to HTML, they come out
something like this:
<font face="Arial">And</font> <font face="Arial">then</font>
<font face="Arial">they</font> <font face="Arial">looked</font>
or even worse:
<font face="Arial">A</font><font face="Arial">n</font><font
face="Arial">d</font>
...
I've come up with a way, using PHP's DOMDocument system, to
scrape a file clear of these, but it's very slow, and it's
basically something that can be done on a stream of text
(rather than having to worry about the document's structure).
I'm thinking of writing something in PHP or C to clean stuff
like this up, but am wondering if anyone else has any experience
and suggestions?
(And yes, I've used "htmltidy", but while that can merge _nested_
styles, e.g., a "<font face="Arial"><font size=+1>" get
combined into its own CSS stype, e.g., "<span class="c123">",
it doesn't seem to be able to merge _consecutive_ styles,
as shown in the examples above. :^/ )
--
-bill!
Sent from my computer
More information about the vox-tech
mailing list