[vox-tech] How to tell if a pdf is text or image?
Ken Bloom
kbloom at gmail.com
Tue Mar 20 21:17:06 PDT 2007
On Tuesday 20 March 2007 21:30, Alex Mandel wrote:
> Anyone know a way to tap into a pdf programmatically to tell if it
> contains text vs was scanned as an image?
>
> I basically just want to sort a directory with many thousands of
> pdfs. I figured there must be something in the header or in the file
> info that either says that it's an image or it has text, or to be
> more complicated gives you a quick percentage of document is text,
> which I could use to set a sort threshold.
>
> Alternately if it can be done more easily on a ps file there's no
> reason why I can't do a pdf2ps on it and then decide how to sort.
> It's really a one time deal so I'll take the overhead on that
> operation.
I think pdfedit http://pdfedit.petricek.net/ can tell you what you want
to know, but it looks insanely hard to give you a quick percentage for
one document, let alone thousands.
At any rate, maybe someone else on vox-tech will find it useful to know
about.
--Ken
--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070320/e8f09c76/attachment.pgp
More information about the vox-tech
mailing list