[vox-tech] OCR notes
Dylan Beaudette
dylan.beaudette at gmail.com
Wed Apr 11 09:44:11 PDT 2007
Hi everyone,
I am about to embark on an exciting adventure into the land of original
character recognition, processing nearly 1,000 documents and extracting
numbers from them. I am interested in any anecdotal wisdom regarding:
1. efficient scanning parameters:
DPI
color / BW / grayscale
2. pre-processing steps one might do with imagemagick
3. any filtering that one might do to get ready for the OCR
I plan to use Google's new OCR project, ocropus, which currently uses
the 'tesseract' engine. Naive attempts to OCR these documents is resulting in
marginal accuracy, so any help is appreciated. Vertical and horizontal lines
on the original documents are confusing the OCR, so removing them might be a
start. I have thought about extracting each 'cell' of data with imagemagick,
and then running the resulting mini-images though the OCR... that might be a
last resort though...
thanks!
--
Dylan Beaudette
Soils and Biogeochemistry Graduate Group
University of California at Davis
530.754.7341
More information about the vox-tech
mailing list