Fwd: Re: [vox-tech] How to tell if a pdf is text or image?
tech_dev at wildintellect.com
tech_dev at wildintellect.com
Thu Mar 22 14:04:38 PDT 2007
Gandalf posted this from a non-subscribed address:
----- Forwarded message from vox-tech-bounces at lists.lugod.org -----
The attached message has been automatically discarded.
Date: Wed, 21 Mar 2007 07:54:22 -0700 (PDT)
From: Gandalf Parker <gandalf at community.net>
Subject: Re: [vox-tech] How to tell if a pdf is text or image?
To: lugod's technical discussion forum <vox-tech at lists.lugod.org>
On Tue, 20 Mar 2007, Alex Mandel wrote:
>Well, I don't actually need the text, I just need to know if it is text.
>The idea is that once I separate them, all the ones that are images can
>then be ocr corrected to text versions.
>So my idea was either a yes/no answer or to say something like, if the
>document is more than 20%(arbitrary) text consider it text.
Try typing
identify
If you have ImageMagick loaded then it will give you plenty. And you can
turn up the verbose setting. Then grep for the one line that tells you
what you need to know. Ive never known it to not give enough info and its
very fast.
Or you might consider loading ImageMagick tools if you dont have it. There
are a ton of very useful options (altho it quickly loses you in graphics
jargon if you try really fancy things). There is a great thumbprint
webpage generator in it which might also speed up the process for you.
Gandalf Parker
I did give that shot, but it only gives me info about the pdf as an image. It can't tell anything about the fonts embedded in the file.
Alex
More information about the vox-tech
mailing list