[vox-tech] OCR notes

Wed Apr 11 09:44:11 PDT 2007

Hi everyone,

I am about to embark on an exciting adventure into the land of original 
character recognition, processing nearly 1,000 documents and extracting 
numbers from them. I am interested in any anecdotal wisdom regarding:

1. efficient scanning parameters:
DPI
color / BW / grayscale

2. pre-processing steps one might do with imagemagick

3. any filtering that one might do to get ready for the OCR

I plan to use Google's new OCR project, ocropus, which currently uses 
the 'tesseract' engine. Naive attempts to OCR these documents is resulting in 
marginal accuracy, so any help is appreciated. Vertical and horizontal lines 
on the original documents are confusing the OCR, so removing them might be a 
start. I have thought about extracting each 'cell' of data with imagemagick, 
and then running the resulting mini-images though the OCR... that might be a 
last resort though...

thanks!

-- 
Dylan Beaudette
Soils and Biogeochemistry Graduate Group
University of California at Davis
530.754.7341