[vox-tech] Suggestions for cleaning up repetitive HTML tags?
Chanoch (Ken) Bloom
kbloom at gmail.com
Wed Aug 18 11:29:14 PDT 2010
On Wed, 2010-08-18 at 10:48 -0700, Bill Kendrick wrote:
> I've come across some documents that are formatted in
> such a way that, when converted to HTML, they come out
> something like this:
>
> <font face="Arial">And</font> <font face="Arial">then</font>
> <font face="Arial">they</font> <font face="Arial">looked</font>
>
> or even worse:
>
> <font face="Arial">A</font><font face="Arial">n</font><font
> face="Arial">d</font>
> ...
>
>
> I've come up with a way, using PHP's DOMDocument system, to
> scrape a file clear of these, but it's very slow, and it's
> basically something that can be done on a stream of text
> (rather than having to worry about the document's structure).
>
> I'm thinking of writing something in PHP or C to clean stuff
> like this up, but am wondering if anyone else has any experience
> and suggestions?
>
> (And yes, I've used "htmltidy", but while that can merge _nested_
> styles, e.g., a "<font face="Arial"><font size=+1>" get
> combined into its own CSS stype, e.g., "<span class="c123">",
> it doesn't seem to be able to merge _consecutive_ styles,
> as shown in the examples above. :^/ )
Consider writing a SAX filter that just drops the offending <font> and
</font>.
Also consider using XPath, like my following example in Ruby (using the
Nokogiri XML library)
require 'nokogiri'
def reform xml
xml.xpath('//font[1]').each do |x|
newcontent=x.content.to_s.dup
textnodes=x.xpath('(following-sibling::text() | following-sibling::font/text())')
x.content=x.content+textnodes.map{|y| y.to_s}.join
textnodes.unlink
x.xpath('following-sibling::font').unlink
end
xml
end
xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font></test>')
puts reform(xml).to_xml
xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> <b>More</b></test>')
puts reform(xml).to_xml
xml=Nokogiri::XML('<test><font face="Arial">And</font> <font face="Arial">then</font> More</test>')
puts reform(xml).to_xml
#That last example probably does the wrong thing
#to fix that you might want the following more complicated version of
#the XPath
def reform xml
xml.xpath('//font[1]').each do |x|
newcontent=x.content.to_s.dup
textnodes=x.xpath('(following-sibling::text()[following-sibling::node()[1][self::font]] | following-sibling::font/text())')
x.content=x.content+textnodes.map{|y| y.to_s}.join
textnodes.unlink
x.xpath('following-sibling::font').unlink
end
xml
end
#More hackage may be necessary depending on the exact structure of your data.
More information about the vox-tech
mailing list