Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Jan W jcwynholds at yahoo.com
Fri Mar 23 10:09:37 PDT 2007


Hi Alex:

We do similar things here where I work.

We have been through this a few times.

The problem is that some PDFs are font info plus text (with script like
"draw a 12pt arial font with text 'hello world' at 30,30).  Some pdfs
are images with 'hidden' text behind regions with no font info (text
'hello world' occupies 30,30 through 90,30). 

So by going after the font, you might miss some of the hidden text
guys...

Here, we finally decided that running pdftotext (from poppler) and
looking if there is any output is the easiest/fastest way (without
pulling apart the pdf with custom code).  If pdftotext exits with 0
status and has no output, then we decide that the pdf is image-based
(kinda kludgy, but it works for us)...

This comes from my shaky knowledge of pdf's, so good luck and HTHO.

:)

jan


> 
> >Well, I don't actually need the text, I just need to know if it is
> text.
> >The idea is that once I separate them, all the ones that are images
> can 
> >then be ocr corrected to text versions.
> >So my idea was either a yes/no answer or to say something like, if
> the 
> >document is more than 20%(arbitrary) text consider it text.


<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><
"The most potent weapon in the hands of the oppressor is the 
mind of the oppressed."
-- Steven Biko
("White Racism and Black Consciousness", in I Write What I Like)
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><


 
____________________________________________________________________________________
Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 


More information about the vox-tech mailing list