podpora slovenčiny v Open Source programoch

tesseract-ocr-en: can i use my data for 2.04?   

posledná zmena: 19. April 2010

back to tesseract-ocr-en

first info

Based on my tests it looks like I miss only few files and I can use most of my trained data for tesseract 2.04.

So I to created these files for Slovak:

When I analyses existing language files for tesseract 3.00 I found out that xxx.config file is not present in any file. So I believe it can bi skipped for the moment.

xxx.unicharambigs is present only in few language files (deu, ell, eng, fra, ita, nld, rus, spa). Based on a content it looks like new version of DangAmbigs with version line and additional column:

2       ' '     1       "       1
2       ` '     1       "       1
2       ' `     1       "       1
2       ‘ '     1       "       1

For a first test:

xxx.punc-dawg (punctuation dictionary?) and xxx.number-dawg (number dictionary?) looks like another Directed Acyclic Word Graph dictionaries. It is enough if there is one word (based on information from DangAmbigs). For first test I ignored them (number and punctuation is in my old slk.word-dawg).

slk.user-words is not used by combine_tessdata

combining & installation of trainned data

Following command produced slk.traineddata without problem:

$ training/combine_tessdata /Projekty/tesseract/tesseract-slovak3/slk.

TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 84
Offset for type 2 is 2155
Offset for type 3 is 2965
Offset for type 4 is 1203323
Offset for type 5 is 1204434
Offset for type 6 is -1
Offset for type 7 is 1228239
Offset for type 8 is -1
Offset for type 9 is 8997547

Than I installed it:

$ sudo cp -f /Projekty/tesseract/tesseract-slovak3/slk.traineddata \

tests and troubleshooting

First test reveled something is wrong:

$ /usr/local/bin/tesseract eurotext.tif eurotext -l slk

unicharset_size > 0:Error:Assert failed:in file dawg.cpp, line 140
Segmentation fault

So I decided to create all slk.dawg with tesseract 3.00 (I used files created with tesseract 2.04). I found out that new version of wordlist2dawg (located in directory training) need more arguments than version in 2.04:

Usage: training/wordlist2dawg [-t] word_list_file dawg_file unicharset_file

So I split old slk.word_list to slk.number, slk.punc and slk.word_list and created new dictionaries:

$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg number \
   slk.number-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg punc \
   slk.punc-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg word_list \
   slk.word-dawg slk.unicharset
$ /usr/src/tesseract-ocr-r319/training/wordlist2dawg frequency_list \
   slk.freq-dawg slk.unicharset

After this change I got new error:

Tesseract Open Source OCR Engine with Leptonica
index >= 0 && index < size_used_:Error:Assert failed:in file ../ccutil/genericvector.h, line 215
Segmentation fault

After few checks I found out that there are some problem with slk.unicharambigs so I just simple removed it.

Than I created and installed slk.traineddata once again. This time tesseract worked with my slk.traineddata.

back to tesseract-ocr-en

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]