sk-spell

podpora slovenčiny v Open Source programoch

tesseract training for Slovak Fraktur script   

posledná zmena: 24. April 2011

back to tesseract-ocr-en

Last weeks I faced interesting challenge: Zlatý fond (Slovak version of Project Gutenberg) planed to digitalize short work „Nepi pálenku“ that was written with Fraktur script. Because last year I help them with digitalization of few Fraktur pages they asked me if I can help them also with this work.

Zlatý fond uses Finereader that works great on „standard“ Slovak script, but it fails with Fraktur. Last year experience revealed that this kind of text must be checked letter by letter. Another problem is missing dictionary for such old Slovak language e.g. you can not use dedicated version of spellchecker to improve result. For this reason I decide to check each page in QT box editor and to extend its features for this kind of task.

Here is part of my experiences…

Image quality matters…

Work was part of very old book. This book was not opened maybe for hundred years… It could not be scanned – contributor could just took pictures with camera. So my input was jpeg files with 180 DPI resolutions and 16,7 Millions colours (24 BitsPerPixel) :-(. I started to make box file with support of German Fraktur language data:

tesseract P1050004-Kopie.jpg P1050004-Kopie -l deu-frak batch.nochop makebox

Results was unsatisfied on most of pages. There was a lot of garbage, a lot of wrong segmented letters, missing boxes… I found out I need to visualize boxes to see result of „ make box“ command and possible improvement when I made adjustment to image. This is story behind button „Show boxes“ in QT box editor.

After several testing I found these steps can help to improve creating boxes:

  1. increase DPI to 400
  2. sharpen image
  3. decrease color depth to 256 grayscale
  4. decrease color depth to 128 grayscale
  5. decrease color depth to 16 grayscale
  6. save as png ;-)

This technique is not cure for all problems: on some images it worked better, on others it was not so good. I did not decrease colour depth to 2 colour because than important details for (human) readability were lost.

Spacing matters…

Tesseract has problem to make correct boxes around letters because of missing space between lines of text or between letters. So splitting/joining symbols was very often command.

Another problem was that boxes were not in order always. The reason was quite clear: because of warped page tesseract was not able to find baseline correctly. I extracted final text from box file (to create dictionary ;-)) so I need to have boxes in order. For this reason I implemented feature to move rows/symbols within box file.

These points remind me how important is to create training images in line with instructions.

I have a idea to „recreate“ image based on my box file correction – to increase space between letter, words and lines… To make straight text lines (on baseline)… At the moment this is not so urgent to me. If this feature is interesting to you and you can code it, feel free send me patch…

back to tesseract-ocr-en

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]