[vox-tech] How to tell if a pdf is text or image?
Alex Mandel
tech_dev at wildintellect.com
Tue Mar 20 19:30:26 PDT 2007
Anyone know a way to tap into a pdf programmatically to tell if it
contains text vs was scanned as an image?
I basically just want to sort a directory with many thousands of pdfs.
I figured there must be something in the header or in the file info that
either says that it's an image or it has text, or to be more complicated
gives you a quick percentage of document is text, which I could use to
set a sort threshold.
Alternately if it can be done more easily on a ps file there's no reason
why I can't do a pdf2ps on it and then decide how to sort.
It's really a one time deal so I'll take the overhead on that operation.
Alex
More information about the vox-tech
mailing list