sk-spell

podpora slovenčiny v Open Source programoch

first notes for tesseract-ocr 3.02 traning

posledná zmena: 4. June 2012

back to tesseract-ocr-en

Tesseract-ocr 3.02 code is here for some time, but there are no informations about changes in its training process. Based on my experience there should be some changes. Here are my notes.

Expected files

Expected filenames/suffixes for creating ‘traineddata’ file are defined in ccutil/tessdatamanager.h ). Short descriptions for these components can be found in manual page of combine_tessdata.

offset	filename	type of file	created by	description
0	config	text	user	(Optional) Language-specific overrides to default config variables.
1	unicharset	text	unicharset_extractor	(Required) The list of symbols that Tesseract recognizes, with properties.
2	unicharambigs	text	user	(Optional) This file contains information on pairs of recognized symbols which are often confused.
3	inttemp	binary	mftraining	(Required) Character shape templates for each unichar.
4	pffmtable	binary/text	mftraining	(Required) The number of features expected for each unichar.
5	normproto	text	cntraining	(Required) Character normalization prototypes
6	punc-dawg	dawg	wordlist2dawg	(Optional) A dawg made from punctuation patterns found around words. The “word” part is replaced by a single space.
7	word-dawg	dawg	wordlist2dawg	(Optional) A dawg made from dictionary words from the language.
8	number-dawg	dawg	wordlist2dawg	(Optional) A dawg made from tokens which originally contained digits. Each digit is replaced by a space character.
9	freq-dawg	dawg	wordlist2dawg	(Optional) A dawg made from the most frequent words which would have gone into word-dawg.
10	fixed-length-dawgs	dawg	wordlist2dawg	(Optional) Several dawgs of different fixed lengths — useful for languages like Chinese.
11	cube-unicharset	text	unknown	(Optional) A unicharset for cube, if cube was trained on a different set of symbols.
12	cube-word-dawg	dawg	wordlist2dawg	(Optional) A word dawg for cube’s alternate unicharset. Not needed if Cube was trained with Tesseract’s unicharset.
13	shapetable	binary	shapeclustering	(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar-id and font.
14	bigram-dawg	dawg	wordlist2dawg	(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?.
15	unambig-dawg	dawg	wordlist2dawg	(Optional)
16	params-training-model	unknown	unknown	(Optional)

I was able do create wordlist from dawg files with tool dawg2wordlist except fixed-length-dawgs (present in present in chi_sim, chi_tra, jpn).

For cube-word-dawg (present in eng, fra) I needed to use cube-unicharset.

cube-unicharset looks like unicharset_extractor v3.00 output.

I did not find unambig-dawg and params-training-mode in any language data file and there is no description for it.

Programs

tesseract – main program; used for OCR, box creating, and training
unicharset_extractor – extract unicharset from Tesseract boxfiles
shapeclustering – shape clustering training for Tesseract
mftraining – feature training for Tesseract
cntraining – character normalization training for Tesseract
wordlist2dawg – convert a wordlist to a DAWG for Tesseract
dawg2wordlist – convert a Tesseract DAWG to a wordlist
combine_tessdata – combine/extract/overwrite Tesseract data
ambiguous_words – generate sets of words Tesseract is likely to find ambiguous
classifier_tester – tests a character classifier on data as formatted for training, but doesn’t have to be the same as the training data.

Training process

Lets assume training for language “mic” and only one font “nice” with input image “mic.nice.exp1.tif”. Here are the steps we need to take:

echo nice 0 0 0 0 0 >> font_properties – this will add information about font to file font_properties
tesseract mic.nice.exp1.tif mic.nice.exp1 batch.nochop makebox – this creates file mic.nice.exp1.box that need to be check/edited (e.g. in QT Box Editor)
tesseract mic.nice.exp1.tif mic.nice.exp1 nobatch box.train – this will create files ‘mic.nice.exp1.tr’ and ‘mic.nice.exp1.txt’
unicharset_extractor mic.nice.exp1.box – this will create file unicharset
shapeclustering -F font_properties -U unicharset mic.nice.exp1.tr – this will create file shapetable
mftraining -F font_properties -U unicharset mic.nice.exp1.tr – this will create files ‘pffmtable’ and ‘inttemp’
cntraining mic.nice.exp1.tr – this will create file normproto
rename filenames:
- mv unicharset mic.unicharset
- mv shapetable mic.shapetable
- mv normproto mic.normproto
- mv pffmtable mic.pffmtable
- mv inttemp mic.inttemp
create dictionaries (optional):
- wordlist2dawg punc_wordlist mic.punc-dawg mic.unicharset
- wordlist2dawg words_wordlist mic.word-dawg mic.unicharset
- wordlist2dawg number_wordlist mic.number-dawg mic.unicharset
- wordlist2dawg frequent_wordlist mic.freq-dawg mic.unicharset
- wordlist2dawg bigram_wordlist mic.bigram-dawg mic.unicharset
combine_tessdata mic. – creates language data file mic.traineddata that can by used by tesseract for OCR.

Comments

I just did test with Latin script so e.g. for Cyrillic or Asian writing system there could be other findings…

Several language files in 3.02 has included (optional) config files. It looks like there could be few suggestions to improve OCR (in case of custom training):

enable_new_segsearch 1 is used in deu, ell, fra, chi_sim, chi_tra, ita, jpn, kor, nld, rus, spa, tha, vie
enable_new_segsearch 0 is used in eng and ben
classify_misfit_junk_penalty 0.125 is used in vie, hin, ben and has this comment: _ Add a penalty for non-alphanumerics that are vertically badly positioned_.
language_model_ngram_on 1 is used in ell, chi_sim, chi_tra, jpn, tha, vie
tessedit_load_sublangs eng is used in bin, mal, tel
tessedit_ocr_engine_mode (1 or 2) is used in ara and hin (see OcrEngineMode enum in third_party/tesseract/ccmain/tesseractclass.h).

Ara, hin, kor, chi_tra, chi_sim and jpn have more complex configs. There can be found groups of parameters regarding new segmentation search parameters, turning off dictionary based penalties, blob filtering thresholds and forcing word segmentation to reduce the length of blob sequences that IMO can be useful also for non-Asian languages tuning.

unicharset_extractor does not fill several information in unicharset:

‘glyph_metrics’ is always ’0,255,0,255,0,32767,0,32767,0,32767’ in my output
‘script’ is always NULL
‘direction’ and ‘mirror’ is always 0.

After investigating of available traineddata I found out that ‘glyph_metrics’, ‘script’ and ‘direction’ is the same per unichar regardless language, so it is possible to correct this information with script. ‘direction’ could by analyzed also according ICU’s enum UCharDirection.

‘mirror’ seem to be related to ‘other_case’: e.g. if “i” has ‘other_case’ = 26 and ‘mirror’ = 15 than “I” has ‘other_case’ = 15 and ‘mirror’ = 26. This should be possible to fix.

For shapeclustering and mftraining you can add option -X xheights. I tried to use xheights file but I did not find difference in shapeclustering and mftraining outputs…

Strace shows that these tools lookin also for file mic.nice.exp1.fontinfo that structure is not documented at the moment.