Sometimes people wander: Why tesseract-ocr is not able to recognize my image correctly? I can read all letters easily. Well problem is that they see image through their eyes. Any you need to see it through tesseract-ocr eye.
And here comes point: tesseract-ocr use binary images. You can pass to tesseract-ocr colour or grayscalled image too, but tesseract will convert them to binary image.
User can use config option:
Here you can see my real experience: on left there is original (input) image and on right there is dumped (binary) image from tesseract-ocr:
Based on this output it is clear I need to “a little” preprocessing before OCR (or training). I decide to use Leptonica library for this. There is the good example file/program named livre_adapt.c. It shows how to normalize a document image for uneven illumination. Here is result:
If you are interesting you can download through_tesseract-ocr_eye.tar.gz where are my images and two files (source code) with simple example how to dump binary image and how to preprocess images with Leptonica.