[vox-tech] Suggestions for cleaning up repetitive HTML tags?

Bill Kendrick nbs at sonic.net
Wed Aug 18 10:48:58 PDT 2010


I've come across some documents that are formatted in
such a way that, when converted to HTML, they come out
something like this:

  <font face="Arial">And</font> <font face="Arial">then</font>
  <font face="Arial">they</font> <font face="Arial">looked</font>

or even worse:

  <font face="Arial">A</font><font face="Arial">n</font><font
  face="Arial">d</font>
  ...


I've come up with a way, using PHP's DOMDocument system, to
scrape a file clear of these, but it's very slow, and it's
basically something that can be done on a stream of text
(rather than having to worry about the document's structure).

I'm thinking of writing something in PHP or C to clean stuff
like this up, but am wondering if anyone else has any experience
and suggestions?

(And yes, I've used "htmltidy", but while that can merge _nested_
styles, e.g., a "<font face="Arial"><font size=+1>" get
combined into its own CSS stype, e.g., "<span class="c123">",
it doesn't seem to be able to merge _consecutive_ styles,
as shown in the examples above. :^/ )


-- 
-bill!
Sent from my computer


More information about the vox-tech mailing list