[vox-tech] How to tell if a pdf is text or image?
hajhouse
hajhouse at houseag.com
Tue Mar 20 21:31:15 PDT 2007
P=E5 2007-03-20, skrev Alex Mandel:
> Anyone know a way to tap into a pdf programmatically to tell if it =
> contains text vs was scanned as an image?
> =
> I basically just want to sort a directory with many thousands of pdfs.
> I figured there must be something in the header or in the file info that =
> either says that it's an image or it has text, or to be more complicated =
> gives you a quick percentage of document is text, which I could use to =
> set a sort threshold.
> =
> Alternately if it can be done more easily on a ps file there's no reason =
> why I can't do a pdf2ps on it and then decide how to sort.
> It's really a one time deal so I'll take the overhead on that operation.
What about converting the PDF files to postscript then running ps2ascii?
-- =
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070320/3be9b9=
42/attachment.pgp
More information about the vox-tech
mailing list