podpora slovenčiny v Open Source programoch

tesseract-ocr-en: dictionary creating   

posledná zmena: 18. May 2010

back to tesseract-ocr-en

dawg files

As explain in my article “test – what is eng.traineddata?“ tesseract 3.00 expects several dawg (Directed Acyclic Word Graph) dictionaries:

These files are created from simple UTF-8 text files (one word per line) by program wordlist2dawg. As a second parameter it needs unicharset file. So for Slovak I run:

$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg number \
   slk.number-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg punc \
   slk.punc-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg word_list \
   slk.word-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg frequency_list \
   slk.freq-dawg slk.unicharset

Dictionary helps to improve result of OCR. For example: in some fonts/cases it is difficult to distinguish between “l” and “1” for OCR software. In such cases dictionary could help: OCR result will not be “a11” but “all” (if “all” is in dictionary and “a11” is not in dictionary).

In tesseract 3.00 dawg dictionaries are optional files (in case of version 2.04 you must have dictionary files otherwise tesseract do not work).

If you decide to create dictionary, there must be at least one word in input file. Input file could be created from wikipedia easily. Other good sources could be spellcheckers, translation dictionaries or other linguistics open projects, but pay attention to license condition of data.

If you need to turn off some dawg file or to increase verbosity for lang.traineddata file, you can use following variables:

variable default setting comment
global_load_punc_dawg true Load dawg with punctuation patterns.
global_load_number_dawg true Load dawg with number patterns.
global_load_freq_dawg true Load frequent word dawg.
global_load_system_dawg true Load system word dawg.
global_tessdata_manager_debug_level 0 Debug level for TessdataManager functions.

ambiguity file – lang.unicharambigs

According Training Tesseract 2.04 this file is created manually. It represents the intrinsic ambiguity between characters or sets of characters. It is optional file (e.g. you can skipped it for creating lang.traineddata)

Here is example of few lines from eng.unicharambigs:

2	' '	1	"	1
2	` ’	1	"	1
2	’ `	1	"	1
2	‘ ‘	1	“	1
2	‘ ’	1	"	1
2	’ ‘	1	"	1
2	’ ’	1	”	1
2	, ,	1	„	1
1	m	2	r n	0
2	r n	1	m	0
1	m	2	i n	0

For tesseract 3.00 there are some changes:

There are several rules for this files:

  1. all characters used in second and fourth column must be present in lang.unicharset file
  2. tab(ulator) or \t is separator between columns
  3. space is separator between characters in second and fourth column
  4. each line (including last line!) must end with (unix?) end-of-line (you must press “ENTER”) otherwise combine_tessdata will produce error (last_char == ‘\n’:Error:Assert failed:in file tessdatamanager.cpp, line 92) - updated on 18.05.2010

If you are interested in the development of lang.unicharambigs please have a look to extracted unicharambigs files from tesseract 3.00 lang.traineddata. Files for following languages are present in this package:

back to tesseract-ocr-en

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]