[vox-tech] How to tell if a pdf is text or image?
Alex Mandel
tech_dev at wildintellect.com
Tue Mar 20 22:36:25 PDT 2007
hajhouse wrote:
> På 2007-03-20, skrev Alex Mandel:
>> Anyone know a way to tap into a pdf programmatically to tell if it
>> contains text vs was scanned as an image?
>>
>> I basically just want to sort a directory with many thousands of pdfs.
>> I figured there must be something in the header or in the file info that
>> either says that it's an image or it has text, or to be more complicated
>> gives you a quick percentage of document is text, which I could use to
>> set a sort threshold.
>>
>> Alternately if it can be done more easily on a ps file there's no reason
>> why I can't do a pdf2ps on it and then decide how to sort.
>> It's really a one time deal so I'll take the overhead on that operation.
>
> What about converting the PDF files to postscript then running ps2ascii?
>
>
>
Well, I don't actually need the text, I just need to know if it is text.
The idea is that once I separate them, all the ones that are images can
then be ocr corrected to text versions.
So my idea was either a yes/no answer or to say something like, if the
document is more than 20%(arbitrary) text consider it text.
So far pdffont tells me what fonts I have, and if it's an image I get
nothing after the header lines. So that might work if I write a program
that makes a temp pdffont and sees if it's longer than just the headers.
I guess I should clarify when I say image, I'm talking about pdf that
were made by scanning a document straight to tiff with no ocr. I know
none of them have pictures, since it's all legal docs at a law firm.
Alex
More information about the vox-tech
mailing list