[vox-tech] Re: How to tell if a pdf is text or image?
hajhouse
hajhouse at houseag.com
Fri Mar 23 03:47:16 PDT 2007
[...]
> PDF is a scripting language. You can look at the raw PDF with a text
> editor and you'll see plain text PDF operators interspersed with
> possibly binary data. In principle PDF is a programming language and the
> only way to tell what it produces is to run it. But in practice, PDF
> code is all machine-written, and you could probably learn to distinguish
> font-using PDFs from pure-image PDFs by examining the raw PDF file.
>
> You could look for the font embedding operators. A document consisting
> only of scanned page images probably won't have any fonts embedded in
> it. Or, if the scanned-paper PDFs are all made by a particular program,
> you might be able to identify particular PDF operator sequences that it
> uses.
In that vein, I ask: is Alex's question about a general method
applicable to the set of all possible PDF files, or are the PDF files of
the particular problem a limited set created by one or a few programs?
--
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070323/9d56c653/attachment-0001.pgp
More information about the vox-tech
mailing list