sk-spell

podpora slovenčiny v Open Source programoch

tesseract-ocr-en: Clustering and Compute the Character Set   

posledná zmena: 28. April 2010

back to tesseract-ocr-en

Training Tesseract wiki page instruct to run mftraining and cntraining programs after creating training files.

mftraining – first attempt

I run it as described on wiki:

$ /usr/local/bin/mftraining *.tr

It produced following message:

Failed to load unicharset from file unicharset
Building unicharset for mftraining from scratch…
Reading slk.arial.001.tr …
slk.arial.001 has no defined properties.
Reading slk.times.001.tr …
slk.times.001 has no defined properties.

strace revealed that mftraining tried to open unicharset file before clustering training files.

So we need to run unicharset_extractor first.

unicharset_extractor

$ /usr/local/bin/unicharset_extractor *.box

This command made simple output message:

Extracting unicharset from slk.arial.001.box
Extracting unicharset from slk.times.001.box
Wrote unicharset file ./unicharset.

unicharset_extractor created one new file – unicharset.

mftraining – second attempt

Than I run mftraining with following output message:

Reading slk.arial.001.tr …
slk.arial.001 has no defined properties.
Reading slk.times.001.tr …
slk.times.001 has no defined properties.


Warning: no protos/configs for / in CreateIntTemplates()
Error: no configs for class / in mftraining
Writing Merged Microfeat …Done!

mftraining created these files:

cntraining

$ /usr/local/bin/cntraining *.tr

Command showed this message:

Reading slk.arial.001.tr …
Reading slk.times.001.tr …
Clustering …

Writing normproto …

cntraining created just one new file – normproto.

renaming

At the end of training I need to rename output from previous steps (slk is my language code):

$ mv unicharset slk.unicharset
$ mv inttemp slk.inttemp
$ mv pffmtable slk.pffmtable
$ mv normproto slk.normproto

As you can be aware files mfunicharset and Microfeat are not present in final language file.

back to tesseract-ocr-en

© projekt sk-spell

RSS [opensource] [w3c] [firefox] [textpattern]