sk-spell

podpora slovenčiny v Open Source programoch

Sometimes people wander: Why tesseract-ocr is not able to recognize my image correctly? I can read all letters easily. Well problem is that they see image through their eyes. Any you need to see it through tesseract-ocr eye.

And here comes point: tesseract-ocr use binary images. You can pass to tesseract-ocr colour or grayscalled image too, but tesseract will convert them to binary image.

Programmer can use api functions GetThresholdedImage or (depreciated) DumpPGM to investigate possible image input problems.

User can use config option:

tessedit_write_images T

that produce “tessinput.tif” output.

Here you can see my real experience: on left there is original (input) image and on right there is dumped (binary) image from tesseract-ocr:

through tesseract-ocr eye: original file on left, binary image used for OCR on right

Based on this output it is clear I need to “a little” preprocessing before OCR (or training). I decide to use Leptonica library for this. There is the good example file/program named livre_adapt.c. It shows how to normalize a document image for uneven illumination. Here is result:

through tesseract-ocr eye: original file on left, binary image used for OCR on right

If you are interesting you can download through_tesseract-ocr_eye.tar.gz where are my images and two files (source code) with simple example how to dump binary image and how to preprocess images with Leptonica.

sk-spell

through tesseract-ocr eye

back to tesseract-ocr-en

© projekt sk-spell