[vox-tech] How to tell if a pdf is text or image?

Alex Mandel tech_dev at wildintellect.com
Tue Mar 20 22:36:25 PDT 2007


hajhouse wrote:
> På 2007-03-20, skrev Alex Mandel:
>> Anyone know a way to tap into a pdf programmatically to tell if it 
>> contains text vs was scanned as an image?
>>
>> I basically just want to sort a directory with many thousands of pdfs.
>> I figured there must be something in the header or in the file info that 
>> either says that it's an image or it has text, or to be more complicated 
>> gives you a quick percentage of document is text, which I could use to 
>> set a sort threshold.
>>
>> Alternately if it can be done more easily on a ps file there's no reason 
>> why I can't do a pdf2ps on it and then decide how to sort.
>> It's really a one time deal so I'll take the overhead on that operation.
> 
> What about converting the PDF files to postscript then running ps2ascii?
> 
> 
> 
Well, I don't actually need the text, I just need to know if it is text.
The idea is that once I separate them, all the ones that are images can 
then be ocr corrected to text versions.
So my idea was either a yes/no answer or to say something like, if the 
document is more than 20%(arbitrary) text consider it text.

So far pdffont tells me what fonts I have, and if it's an image I get 
nothing after the header lines. So that might work if I write a program 
that makes a temp pdffont and sees if it's longer than just the headers.

I guess I should clarify when I say image, I'm talking about pdf that 
were made by scanning a document straight to tiff with no ocr. I know 
none of them have pictures, since it's all legal docs at a law firm.

Alex


More information about the vox-tech mailing list