[vox-tech] How to tell if a pdf is text or image?

hajhouse hajhouse at houseag.com
Tue Mar 20 23:15:36 PDT 2007


[...]
> Well, I don't actually need the text, I just need to know if it is text.
> The idea is that once I separate them, all the ones that are images can 
> then be ocr corrected to text versions.
> So my idea was either a yes/no answer or to say something like, if the 
> document is more than 20%(arbitrary) text consider it text.

Right, but presumably you wouldn't get any text from ps2ascii if the
document were just images, making it an effective test for the scenario
you have described. Although I think your idea of using pdffont would be
equally effective and possibly easier to use in a script.

> So far pdffont tells me what fonts I have, and if it's an image I get 
> nothing after the header lines. So that might work if I write a program 
> that makes a temp pdffont and sees if it's longer than just the headers.


-- 
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070320/938442ac/attachment.pgp


More information about the vox-tech mailing list