[vox-tech] OCR notes

hajhouse hajhouse at houseag.com
Wed Apr 11 10:50:39 PDT 2007


P=E5 2007-04-11, skrev Dylan Beaudette:
> Hi everyone,
> =

> I am about to embark on an exciting adventure into the land of original =

> character recognition, processing nearly 1,000 documents and extracting =

> numbers from them. I am interested in any anecdotal wisdom regarding:
> =

> 1. efficient scanning parameters:
> DPI
> color / BW / grayscale

B&W, as high DPI as feasible.

> 2. pre-processing steps one might do with imagemagick

Clipping off borders is recommended.

> 3. any filtering that one might do to get ready for the OCR

Make sure there are no handwritten notes, post-it pieces, or other
miscellaneous cruft on the documents before scanning them. If the paper
is colored or there are ghost images (such as the back-side printing
showing through thin paper), scan in grayscale and then carefully reduce
to B&W with an appropriate hand-picked threshhold. I think I used
pnmremap to do that the last time that need came up for me.

> I plan to use Google's new OCR project, ocropus, which currently uses =

> the 'tesseract' engine. Naive attempts to OCR these documents is resultin=
g in =

> marginal accuracy, so any help is appreciated. Vertical and horizontal li=
nes =

> on the original documents are confusing the OCR, so removing them might b=
e a =

> start. I have thought about extracting each 'cell' of data with imagemagi=
ck, =

> and then running the resulting mini-images though the OCR... that might b=
e a =

> last resort though...

Neat. I've never tried that. The only OCR engine I've sucessfully used
is gocr, which was pretty decent and worked out of the box with minimal
tweaking. I tried Clara but it seemed unstable and I gave up before I
could figure out how to make it work.


-- =

Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.lugod.org/pipermail/vox-tech/attachments/20070411/eabc28=
28/attachment.pgp


More information about the vox-tech mailing list