l i n u x - u s e r s - g r o u p - o f - d a v i s
Next Meeting:
July 7: Social gathering
Next Installfest:
Latest News:
Jun. 14: June LUGOD meeting cancelled
Page last updated:
2007 Apr 13 20:33

The following is an archive of a post made to our 'vox-tech mailing list' by one of its subscribers.

Report this post as spam:

(Enter your email address)
Re: [vox-tech] OCR notes
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vox-tech] OCR notes

På 2007-04-11, skrev Dylan Beaudette:
> Hi everyone,
> I am about to embark on an exciting adventure into the land of original 
> character recognition, processing nearly 1,000 documents and extracting 
> numbers from them. I am interested in any anecdotal wisdom regarding:
> 1. efficient scanning parameters:
> color / BW / grayscale

B&W, as high DPI as feasible.

> 2. pre-processing steps one might do with imagemagick

Clipping off borders is recommended.

> 3. any filtering that one might do to get ready for the OCR

Make sure there are no handwritten notes, post-it pieces, or other
miscellaneous cruft on the documents before scanning them. If the paper
is colored or there are ghost images (such as the back-side printing
showing through thin paper), scan in grayscale and then carefully reduce
to B&W with an appropriate hand-picked threshhold. I think I used
pnmremap to do that the last time that need came up for me.

> I plan to use Google's new OCR project, ocropus, which currently uses 
> the 'tesseract' engine. Naive attempts to OCR these documents is resulting in 
> marginal accuracy, so any help is appreciated. Vertical and horizontal lines 
> on the original documents are confusing the OCR, so removing them might be a 
> start. I have thought about extracting each 'cell' of data with imagemagick, 
> and then running the resulting mini-images though the OCR... that might be a 
> last resort though...

Neat. I've never tried that. The only OCR engine I've sucessfully used
is gocr, which was pretty decent and worked out of the box with minimal
tweaking. I tried Clara but it seemed unstable and I gave up before I
could figure out how to make it work.

Henry House
+1 530 753 3361 ext. 13
Please don't send me HTML mail! My mail system frequently rejects it.
The unintelligible text that may follow is a digital signature.
See <http://hajhouse.org/pgp> to find out how to use it.
My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.

Attachment: signature.asc
Description: Digital signature

vox-tech mailing list

LUGOD Group on LinkedIn
Sign up for LUGOD event announcements
Your email address:
LUGOD Group on Facebook
'Like' LUGOD on Facebook:

Hosting provided by:
Sunset Systems
Sunset Systems offers preconfigured Linux systems, remote system administration and custom software development.

LUGOD: Linux Users' Group of Davis
PO Box 2082, Davis, CA 95617
Contact Us

LUGOD is a 501(c)7 non-profit organization
based in Davis, California
and serving the Sacramento area.
"Linux" is a trademark of Linus Torvalds.

Sponsored in part by:
EDGE Tech Corp.
For donating some give-aways for our meetings.