[vox-tech] OCR notes
hajhouse
hajhouse at houseag.com
Wed Apr 11 10:50:39 PDT 2007
P=E5 2007-04-11, skrev Dylan Beaudette:
> Hi everyone,
> =
> I am about to embark on an exciting adventure into the land of original =
> character recognition, processing nearly 1,000 documents and extracting =
> numbers from them. I am interested in any anecdotal wisdom regarding:
> =
> 1. efficient scanning parameters:
> DPI
> color / BW / grayscale
B&W, as high DPI as feasible.
> 2. pre-processing steps one might do with imagemagick
Clipping off borders is recommended.
> 3. any filtering that one might do to get ready for the OCR
Make sure there are no handwritten notes, post-it pieces, or other
miscellaneous cruft on the documents before scanning them. If the paper
is colored or there are ghost images (such as the back-side printing
showing through thin paper), scan in grayscale and then carefully reduce
to B&W with an appropriate hand-picked threshhold. I think I used
pnmremap to do that the last time that need came up for me.
> I plan to use Google's new OCR project, ocropus, which currently uses =
> the 'tesseract' engine. Naive attempts to OCR these documents is resultin=
g in =
> marginal accuracy, so any help is appreciated. Vertical and horizontal li=
nes =
> on the original documents are confusing the OCR, so removing them might b=
e a =
> start. I have thought about extracting each 'cell' of data with imagemagi=
ck, =
> and then running the resulting mini-images though the OCR... that might b=
e a =
> last resort though...
Neat. I've never tried that. The only OCR engine I've sucessfully used
is gocr, which was pretty decent and worked out of the box with minimal
tweaking. I tried Clara but it seemed unstable and I gave up before I
could figure out how to make it work.
-- =
Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070411/eabc28=
28/attachment.pgp
More information about the vox-tech
mailing list