Fwd: Re: [vox-tech] How to tell if a pdf is text or image?

Thu Mar 22 14:04:38 PDT 2007

Gandalf posted this from a non-subscribed address:

----- Forwarded message from vox-tech-bounces at lists.lugod.org -----

The attached message has been automatically discarded.
Date: Wed, 21 Mar 2007 07:54:22 -0700 (PDT)
From: Gandalf Parker <gandalf at community.net>
Subject: Re: [vox-tech] How to tell if a pdf is text or image?
To: lugod's technical discussion forum <vox-tech at lists.lugod.org>

On Tue, 20 Mar 2007, Alex Mandel wrote:

>Well, I don't actually need the text, I just need to know if it is text.
>The idea is that once I separate them, all the ones that are images can 
>then be ocr corrected to text versions.
>So my idea was either a yes/no answer or to say something like, if the 
>document is more than 20%(arbitrary) text consider it text.

Try typing
identify

If you have ImageMagick loaded then it will give you plenty. And you can 
turn up the verbose setting. Then grep for the one line that tells you 
what you need to know. Ive never known it to not give enough info and its 
very fast.

Or you might consider loading ImageMagick tools if you dont have it. There 
are a ton of very useful options (altho it quickly loses you in graphics 
jargon if you try really fancy things). There is a great thumbprint 
webpage generator in it which might also speed up the process for you.

Gandalf  Parker

I did give that shot, but it only gives me info about the pdf as an image. It can't tell anything about the fonts embedded in the file.

Alex