[vox-tech] Suggestions for cleaning up repetitive HTML tags?

Wed Aug 18 20:40:52 PDT 2010

On Wed, 2010-08-18 at 16:23 -0700, Bill Kendrick wrote:
> On Wed, Aug 18, 2010 at 01:29:14PM -0500, Chanoch (Ken) Bloom wrote:
> > Consider writing a SAX filter that just drops the offending <font> and
> > </font>.
> 
> Well, we want the style info to remain... there's just no reason in
> the world for the document to specify it over and over again on
> a per-word or per-character(!) basis. :)

It doesn't have to drop all of them. I'm not sure how easy or hard it
will be to model the state that you need to do this correctly though.
That depends on the structure of your data.

> 
> > Also consider using XPath, like my following example in Ruby (using the
> > Nokogiri XML library)
> 
> Ooooh.  Thanks, I'll poke at this.  (I know there's some some Xpath stuff
> in PHP that I know nothing about, since I've only spoken to it about
> XML via its DOMDocument stuff, so far.

Note that part of the challenge here is dealing with the spaces that
come between the font tags -- deciding which ones should be included in
the newly-merged font tags, and which shouldn't. So for both techniques,
experimentation is in order.

--Ken