[vox-tech] How to tell if a pdf is text or image?

Ken Bloom kbloom at gmail.com
Tue Mar 20 21:17:06 PDT 2007


On Tuesday 20 March 2007 21:30, Alex Mandel wrote:
> Anyone know a way to tap into a pdf programmatically to tell if it
> contains text vs was scanned as an image?
>
> I basically just want to sort a directory with many thousands of
> pdfs. I figured there must be something in the header or in the file
> info that either says that it's an image or it has text, or to be
> more complicated gives you a quick percentage of document is text,
> which I could use to set a sort threshold.
>
> Alternately if it can be done more easily on a ps file there's no
> reason why I can't do a pdf2ps on it and then decide how to sort.
> It's really a one time deal so I'll take the overhead on that
> operation.

I think pdfedit http://pdfedit.petricek.net/ can tell you what you want 
to know, but it looks insanely hard to give you a quick percentage for 
one document, let alone thousands.

At any rate, maybe someone else on vox-tech will find it useful to know 
about.

--Ken

-- 
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070320/e8f09c76/attachment.pgp


More information about the vox-tech mailing list