[vox-tech] Suggestions for cleaning up repetitive HTML tags?

Wed Aug 18 11:29:14 PDT 2010

On Wed, 2010-08-18 at 10:48 -0700, Bill Kendrick wrote:
> I've come across some documents that are formatted in
> such a way that, when converted to HTML, they come out
> something like this:
> 
>   <font face="Arial">And</font> <font face="Arial">then</font>
>   <font face="Arial">they</font> <font face="Arial">looked</font>
> 
> or even worse:
> 
>   <font face="Arial">A</font><font face="Arial">n</font><font
>   face="Arial">d</font>
>   ...
> 
> 
> I've come up with a way, using PHP's DOMDocument system, to
> scrape a file clear of these, but it's very slow, and it's
> basically something that can be done on a stream of text
> (rather than having to worry about the document's structure).
> 
> I'm thinking of writing something in PHP or C to clean stuff
> like this up, but am wondering if anyone else has any experience
> and suggestions?
> 
> (And yes, I've used "htmltidy", but while that can merge _nested_
> styles, e.g., a "<font face="Arial"><font size=+1>" get
> combined into its own CSS stype, e.g., "<span class="c123">",
> it doesn't seem to be able to merge _consecutive_ styles,
> as shown in the examples above. :^/ )

Consider writing a SAX filter that just drops the offending <font> and
</font>.

Also consider using XPath, like my following example in Ruby (using the
Nokogiri XML library)

require 'nokogiri'
def reform xml
  xml.xpath('//font[1]').each do |x|
    newcontent=x.content.to_s.dup
    textnodes=x.xpath('(following-sibling::text() | following-sibling::font/text())')
    x.content=x.content+textnodes.map{|y| y.to_s}.join
    textnodes.unlink
    x.xpath('following-sibling::font').unlink
  end
  xml
end

xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font></test>')
puts reform(xml).to_xml

xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> <b>More</b></test>')
puts reform(xml).to_xml

xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> More</test>')
puts reform(xml).to_xml

#That last example probably does the wrong thing
#to fix that you might want the following more complicated version of
#the XPath

def reform xml
  xml.xpath('//font[1]').each do |x|
    newcontent=x.content.to_s.dup
    textnodes=x.xpath('(following-sibling::text()[following-sibling::node()[1][self::font]] | following-sibling::font/text())')
    x.content=x.content+textnodes.map{|y| y.to_s}.join
    textnodes.unlink
    x.xpath('following-sibling::font').unlink
  end
  xml
end

#More hackage may be necessary depending on the exact structure of your data.