[vox-tech] How to tell if a pdf is text or image?

hajhouse hajhouse at houseag.com
Tue Mar 20 21:31:15 PDT 2007


P=E5 2007-03-20, skrev Alex Mandel:
> Anyone know a way to tap into a pdf programmatically to tell if it =

> contains text vs was scanned as an image?
> =

> I basically just want to sort a directory with many thousands of pdfs.
> I figured there must be something in the header or in the file info that =

> either says that it's an image or it has text, or to be more complicated =

> gives you a quick percentage of document is text, which I could use to =

> set a sort threshold.
> =

> Alternately if it can be done more easily on a ps file there's no reason =

> why I can't do a pdf2ps on it and then decide how to sort.
> It's really a one time deal so I'll take the overhead on that operation.

What about converting the PDF files to postscript then running ps2ascii?

-- =

Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070320/3be9b9=
42/attachment.pgp


More information about the vox-tech mailing list