[vox-tech] How to tell if a pdf is text or image?

Alex Mandel tech_dev at wildintellect.com
Tue Mar 20 19:30:26 PDT 2007


Anyone know a way to tap into a pdf programmatically to tell if it 
contains text vs was scanned as an image?

I basically just want to sort a directory with many thousands of pdfs.
I figured there must be something in the header or in the file info that 
either says that it's an image or it has text, or to be more complicated 
gives you a quick percentage of document is text, which I could use to 
set a sort threshold.

Alternately if it can be done more easily on a ps file there's no reason 
why I can't do a pdf2ps on it and then decide how to sort.
It's really a one time deal so I'll take the overhead on that operation.

Alex


More information about the vox-tech mailing list