Tesseract-ocr 3.02 code is here for some time, but there are no informations about changes in its training process. Based on my experience there should be some changes. Here are my notes.
Expected filenames/suffixes for creating ‘traineddata’ file are defined in ccutil/tessdatamanager.h ). Short descriptions for these components can be found in manual page of combine_tessdata.
|offset||filename||type of file||created by||description|
|0||config||text||user||(Optional) Language-specific overrides to default config variables.|
|1||unicharset||text||unicharset_extractor||(Required) The list of symbols that Tesseract recognizes, with properties.|
|2||unicharambigs||text||user||(Optional) This file contains information on pairs of recognized symbols which are often confused.|
|3||inttemp||binary||mftraining||(Required) Character shape templates for each unichar.|
|4||pffmtable||binary/text||mftraining||(Required) The number of features expected for each unichar.|
|5||normproto||text||cntraining||(Required) Character normalization prototypes|
|6||punc-dawg||dawg||wordlist2dawg||(Optional) A dawg made from punctuation patterns found around words. The “word” part is replaced by a single space.|
|7||word-dawg||dawg||wordlist2dawg||(Optional) A dawg made from dictionary words from the language.|
|8||number-dawg||dawg||wordlist2dawg||(Optional) A dawg made from tokens which originally contained digits. Each digit is replaced by a space character.|
|9||freq-dawg||dawg||wordlist2dawg||(Optional) A dawg made from the most frequent words which would have gone into word-dawg.|
|10||fixed-length-dawgs||dawg||wordlist2dawg||(Optional) Several dawgs of different fixed lengths — useful for languages like Chinese.|
|11||cube-unicharset||text||unknown||(Optional) A unicharset for cube, if cube was trained on a different set of symbols.|
|12||cube-word-dawg||dawg||wordlist2dawg||(Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube was trained with Tesseract’s unicharset.|
|13||shapetable||binary||shapeclustering||(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar-id and font.|
|14||bigram-dawg||dawg||wordlist2dawg||(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?.|
For cube-word-dawg (present in eng, fra) I needed to use cube-unicharset.
cube-unicharset looks like unicharset_extractor v3.00 output.
I did not find unambig-dawg and params-training-mode in any language data file and there is no description for it.
Lets assume training for language “mic” and only one font “nice” with input image “mic.nice.exp1.tif”. Here are the steps we need to take:
echo nice 0 0 0 0 0 >> font_properties– this will add information about font to file font_properties
tesseract mic.nice.exp1.tif mic.nice.exp1 batch.nochop makebox– this creates file mic.nice.exp1.box that need to be check/edited (e.g. in QT Box Editor)
tesseract mic.nice.exp1.tif mic.nice.exp1 nobatch box.train– this will create files ‘mic.nice.exp1.tr’ and ‘mic.nice.exp1.txt’
unicharset_extractor mic.nice.exp1.box– this will create file unicharset
shapeclustering -F font_properties -U unicharset mic.nice.exp1.tr– this will create file shapetable
mftraining -F font_properties -U unicharset mic.nice.exp1.tr– this will create files ‘pffmtable’ and ‘inttemp’
cntraining mic.nice.exp1.tr– this will create file normproto
mv unicharset mic.unicharset
mv shapetable mic.shapetable
mv normproto mic.normproto
mv pffmtable mic.pffmtable
mv inttemp mic.inttemp
wordlist2dawg punc_wordlist mic.punc-dawg mic.unicharset
wordlist2dawg words_wordlist mic.word-dawg mic.unicharset
wordlist2dawg number_wordlist mic.number-dawg mic.unicharset
wordlist2dawg frequent_wordlist mic.freq-dawg mic.unicharset
wordlist2dawg bigram_wordlist mic.bigram-dawg mic.unicharset
combine_tessdata mic.– creates language data file mic.traineddata that can by used by tesseract for OCR.
I just did test with Latin script so e.g. for Cyrillic or Asian writing system there could be other findings…
Several language files in 3.02 has included (optional) config files. It looks like there could be few suggestions to improve OCR (in case of custom training):
enable_new_segsearch 1is used in deu, ell, fra, chi_sim, chi_tra, ita, jpn, kor, nld, rus, spa, tha, vie
enable_new_segsearch 0is used in eng and ben
classify_misfit_junk_penalty 0.125is used in vie, hin, ben and has this comment: _ Add a penalty for non-alphanumerics that are vertically badly positioned_.
language_model_ngram_on 1is used in ell, chi_sim, chi_tra, jpn, tha, vie
tessedit_load_sublangs engis used in bin, mal, tel
tessedit_ocr_engine_mode(1 or 2) is used in ara and hin (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).
Ara, hin, kor, chi_tra, chi_sim and jpn have more complex configs. There can be found groups of parameters regarding new segmentation search parameters, turning off dictionary based penalties, blob filtering thresholds and forcing word segmentation to reduce the length of blob sequences that IMO can be useful also for non-Asian languages tuning.
unicharset_extractor does not fill several information in unicharset:
After investigating of available traineddata I found out that ‘glyph_metrics’, ‘script’ and ‘direction’ is the same per unichar regardless language, so it is possible to correct this information with script. ‘direction’ could by analyzed also according ICU’s enum UCharDirection.
‘mirror’ seem to be related to ‘other_case’: e.g. if “i” has ‘other_case’ = 26 and ‘mirror’ = 15 than “I” has ‘other_case’ = 15 and ‘mirror’ = 26. This should be possible to fix.
For shapeclustering and mftraining you can add option -X xheights. I tried to use xheights file but I did not find difference in shapeclustering and mftraining outputs…
Strace shows that these tools lookin also for file mic.nice.exp1.fontinfo that structure is not documented at the moment.