Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Ken Herron kherron+lugod at fmailbox.com
Thu Mar 22 19:09:51 PDT 2007


> Well, I don't actually need the text, I just need to know if it is text.
> The idea is that once I separate them, all the ones that are images can 
> then be ocr corrected to text versions.
> So my idea was either a yes/no answer or to say something like, if the 
> document is more than 20%(arbitrary) text consider it text.

PDF is a scripting language. You can look at the raw PDF with a text 
editor and you'll see plain text PDF operators interspersed with 
possibly binary data. In principle PDF is a programming language and the 
only way to tell what it produces is to run it. But in practice, PDF 
code is all machine-written, and you could probably learn to distinguish 
font-using PDFs from pure-image PDFs by examining the raw PDF file.

You could look for the font embedding operators. A document consisting 
only of scanned page images probably won't have any fonts embedded in 
it. Or, if the scanned-paper PDFs are all made by a particular program, 
you might be able to identify particular PDF operator sequences that it 
uses.


More information about the vox-tech mailing list