[vox-tech] Re: How to tell if a pdf is text or image?

hajhouse hajhouse at houseag.com
Fri Mar 23 03:47:16 PDT 2007


[...]
> PDF is a scripting language. You can look at the raw PDF with a text 
> editor and you'll see plain text PDF operators interspersed with 
> possibly binary data. In principle PDF is a programming language and the 
> only way to tell what it produces is to run it. But in practice, PDF 
> code is all machine-written, and you could probably learn to distinguish 
> font-using PDFs from pure-image PDFs by examining the raw PDF file.
> 
> You could look for the font embedding operators. A document consisting 
> only of scanned page images probably won't have any fonts embedded in 
> it. Or, if the scanned-paper PDFs are all made by a particular program, 
> you might be able to identify particular PDF operator sequences that it 
> uses.

In that vein, I ask: is Alex's question about a general method
applicable to the set of all possible PDF files, or are the PDF files of
the particular problem a limited set created by one or a few programs?

-- 
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070323/9d56c653/attachment-0001.pgp


More information about the vox-tech mailing list