sk-spell

podpora slovenčiny v Open Source programoch

tesseract-ocr-en: what is eng.traineddata?   

posledná zmena: 19. April 2010

back to tesseract-ocr-en

eng.traineddata is language file. It can be created with tool combine_tessdata that is in tesseract directory training.
So if you want to create new language file for Geman try this command:

$ training/combine_tessdata /usr/src/tesseract-2.04/tessdata/deu.

I got following output:

TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 84
Offset for type 2 is -1
Offset for type 3 is 610
Offset for type 4 is 906240
Offset for type 5 is 906889
Offset for type 6 is -1
Offset for type 7 is 949106
Offset for type 8 is -1
Offset for type 9 is 2112106

strace [1] reveled that for creating deu.traineddata (in /usr/src/tesseract-2.04/tessdata/) expect these 10 files:

“-1” describes (in paragraph above) that file is not existing. So eng.traineddata is package of files: eng.config, eng.unicharset, eng.unicharambigs, eng.inttemp, eng.pffmtable, eng.normproto, eng.punc-dawg, eng.word-dawg, eng.number-dawg and eng.freq-dawg.

First 84 bytes is header of language file. From position 0×0004 to 0×0053 there are 32bit informations about offset each of file.

[1] $ strace training/combine_tessdata \
   /usr/src/tesseract-2.04/tessdata/deu. 2>&1 | grep open | grep deu
back to tesseract-ocr-en

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]