Update of language files for tesseract-ocr 3.04   

posledná zmena: 30. June 2015

There was huge update of tesseract-ocr language files on 24.06.2015 – 98 traineddata were updated or first uploaded. At the moment 105 of language or language version are supported (+2 special modules osd and equ). The corresponding source training data where commited into langdata repository.

ara, eng, hin, kor, osd, equ traineddata are NOT updated due to regression. The other regressions are mostly fixed, with some dramatic improvements particularly for Indic (like 20% for kan for example).

There was no update for cube files, because cube is dead end and will be removed after new classifier implementation.

language files

Language files are located in separate repository tessdata on Total file size is 1.2 GiB at the moment. If you plan to clone whole repository you need to calculate much more space because git keep all history (e.g. previous version of files) in local copy…

Therefore it would be more efficient to download only needed language files. I created small program get_tessdata.cpp that download and install desired file for you. It is not smart (e.g. there could be problem with proxy, or you can not choose file version), but I hope it can help. Usage is simple (after compilation ;-) ) – e.g.:

sudo ./get_tessdata -f fra.traineddata

Feel free to improve it/replace it ;-) and share it tesseract-ocr user forum.

Here is the list of available files at tessdata repository as of 29.06.2015:

lang code lang name file size link
afr Afrikaans 5.0 MiB afr.traineddata
amh Amharic 2.8 MiB amh.traineddata
ara Arabic 99.5 KiB ara.cube.bigrams
ara Arabic 4.0 B ara.cube.fold
ara Arabic 241.0 B ara.cube.lm
ara Arabic 820.7 KiB ara.cube.nn
ara Arabic 251.0 B ara.cube.params
ara Arabic 19.1 MiB ara.cube.size
ara Arabic 1.2 MiB ara.cube.word-freq
ara Arabic 6.0 MiB ara.traineddata
asm Assamese 15.1 MiB asm.traineddata
aze Azerbaijani 6.3 MiB aze.traineddata
aze_cyrl Azerbaijani – Cyrilic 2.7 MiB aze_cyrl.traineddata
bel Belarusian 6.5 MiB bel.traineddata
ben Bengali 14.8 MiB ben.traineddata
bod Tibetan 24.1 MiB bod.traineddata
bos Bosnian 5.2 MiB bos.traineddata
bul Bulgarian 5.7 MiB bul.traineddata
cat Catalan; Valencian 5.1 MiB cat.traineddata
ceb Cebuano 1.6 MiB ceb.traineddata
ces Czech 11.3 MiB ces.traineddata
chi_sim Chinese – Simplified 40.1 MiB chi_sim.traineddata
chi_tra Chinese – Traditional 54.1 MiB chi_tra.traineddata
chr Cherokee 1.0 MiB chr.traineddata
cym Welsh 3.6 MiB cym.traineddata
dan Danish 7.0 MiB dan.traineddata
dan_frak Danish – Fraktur 1.5 MiB dan_frak.traineddata
deu German 12.7 MiB deu.traineddata
deu_frak German – Fraktur 1.9 MiB deu_frak.traineddata
dzo Dzongkha 3.2 MiB dzo.traineddata
ell Greek, Modern (1453-) 5.2 MiB ell.traineddata
eng English 167.9 KiB eng.cube.bigrams
eng English 38.0 B eng.cube.fold
eng English 181.0 B eng.cube.lm
eng English 837.2 KiB eng.cube.nn
eng English 254.0 B eng.cube.params
eng English 12.4 MiB eng.cube.size
eng English 2.3 MiB eng.cube.word-freq
eng English 996.0 B eng.tesseract_cube.nn
eng English 20.9 MiB eng.traineddata
enm English, Middle (1100-1500) 2.0 MiB enm.traineddata
epo Esperanto 6.3 MiB epo.traineddata
equ Math / equation detection module 2.1 MiB equ.traineddata
est Estonian 9.2 MiB est.traineddata
eus Basque 4.7 MiB eus.traineddata
fas Persian 4.6 MiB fas.traineddata
fin Finnish 12.7 MiB fin.traineddata
fra French 127.0 KiB fra.cube.bigrams
fra French 59.0 B fra.cube.fold
fra French 301.0 B fra.cube.lm
fra French 949.5 KiB fra.cube.nn
fra French 242.0 B fra.cube.params
fra French 18.4 MiB fra.cube.size
fra French 2.8 MiB fra.cube.word-freq
fra French 660.0 B fra.tesseract_cube.nn
fra French 13.4 MiB fra.traineddata
frk Frankish 15.7 MiB frk.traineddata
frm French, Middle (ca.1400-1600) 15.1 MiB frm.traineddata
gle Irish 3.3 MiB gle.traineddata
glg Galician 5.3 MiB glg.traineddata
grc Greek, Ancient (to 1453) 4.9 MiB grc.traineddata
guj Gujarati 10.1 MiB guj.traineddata
hat Haitian; Haitian Creole 1.3 MiB hat.traineddata
heb Hebrew 4.1 MiB heb.traineddata
hin Hindi 67.4 KiB hin.cube.bigrams
hin Hindi 1.0 B hin.cube.fold
hin Hindi 211.0 B hin.cube.lm
hin Hindi 6.9 MiB hin.cube.nn
hin Hindi 262.0 B hin.cube.params
hin Hindi 1.2 MiB hin.cube.word-freq
hin Hindi 660.0 B hin.tesseract_cube.nn
hin Hindi 13.5 MiB hin.traineddata
hrv Croatian 8.7 MiB hrv.traineddata
hun Hungarian 11.6 MiB hun.traineddata
iku Inuktitut 971.9 KiB iku.traineddata
ind Indonesian 6.2 MiB ind.traineddata
isl Icelandic 5.8 MiB isl.traineddata
ita Italian 119.8 KiB ita.cube.bigrams
ita Italian 51.0 B ita.cube.fold
ita Italian 257.0 B ita.cube.lm
ita Italian 872.1 KiB ita.cube.nn
ita Italian 314.0 B ita.cube.params
ita Italian 13.3 MiB ita.cube.size
ita Italian 3.4 MiB ita.cube.word-freq
ita Italian 660.0 B ita.tesseract_cube.nn
ita Italian 13.6 MiB ita.traineddata
ita_old Italian – Old 13.4 MiB ita_old.traineddata
jav Javanese 4.2 MiB jav.traineddata
jpn Japanese 31.5 MiB jpn.traineddata
kan Kannada 34.0 MiB kan.traineddata
kat Georgian 5.9 MiB kat.traineddata
kat_old Georgian – Old 643.9 KiB kat_old.traineddata
kaz Kazakh 4.3 MiB kaz.traineddata
khm Central Khmer 46.6 MiB khm.traineddata
kir Kirghiz; Kyrgyz 5.2 MiB kir.traineddata
kor Korean 12.7 MiB kor.traineddata
kur Kurdish 1.9 MiB kur.traineddata
lao Lao 20.1 MiB lao.traineddata
lat Latin 5.7 MiB lat.traineddata
lav Latvian 7.4 MiB lav.traineddata
lit Lithuanian 8.5 MiB lit.traineddata
mal Malayalam 8.4 MiB mal.traineddata
mar Marathi 13.6 MiB mar.traineddata
mkd Macedonian 3.7 MiB mkd.traineddata
mlt Maltese 4.9 MiB mlt.traineddata
msa Malay 6.2 MiB msa.traineddata
mya Burmese 66.5 MiB mya.traineddata
nep Nepali 15.1 MiB nep.traineddata
nld Dutch; Flemish 16.3 MiB nld.traineddata
nor Norwegian 7.9 MiB nor.traineddata
ori Oriya 7.5 MiB ori.traineddata
osd Orientation and script detection module 10.1 MiB osd.traineddata
pan Panjabi; Punjabi 9.7 MiB pan.traineddata
pol Polish 13.3 MiB pol.traineddata
por Portuguese 12.3 MiB por.traineddata
pus Pushto; Pashto 2.4 MiB pus.traineddata
ron Romanian; Moldavian; Moldovan 7.6 MiB ron.traineddata
rus Russian 139.0 B rus.cube.fold
rus Russian 278.0 B rus.cube.lm
rus Russian 891.4 KiB rus.cube.nn
rus Russian 317.0 B rus.cube.params
rus Russian 14.5 MiB rus.cube.size
rus Russian 6.7 MiB rus.cube.word-freq
rus Russian 15.4 MiB rus.traineddata
san Sanskrit 21.7 MiB san.traineddata
sin Sinhala; Sinhalese 6.5 MiB sin.traineddata
slk Slovak 8.7 MiB slk.traineddata
slk_frak Slovak – Fraktur 825.4 KiB slk_frak.traineddata
slv Slovenian 6.5 MiB slv.traineddata
spa Spanish; Castilian 128.9 KiB spa.cube.bigrams
spa Spanish; Castilian 76.0 B spa.cube.fold
spa Spanish; Castilian 248.0 B spa.cube.lm
spa Spanish; Castilian 887.5 KiB spa.cube.nn
spa Spanish; Castilian 243.0 B spa.cube.params
spa Spanish; Castilian 18.1 MiB spa.cube.size
spa Spanish; Castilian 3.1 MiB spa.cube.word-freq
spa Spanish; Castilian 15.2 MiB spa.traineddata
spa_old Spanish; Castilian – Old 16.0 MiB spa_old.traineddata
sqi Albanian 6.3 MiB sqi.traineddata
srp Serbian 4.4 MiB srp.traineddata
srp_latn Serbian – Latin 5.8 MiB srp_latn.traineddata
swa Swahili 3.7 MiB swa.traineddata
swe Swedish 9.0 MiB swe.traineddata
syr Syriac 2.6 MiB syr.traineddata
tam Tamil 4.9 MiB tam.traineddata
tel Telugu 37.5 MiB tel.traineddata
tgk Tajik 1.1 MiB tgk.traineddata
tgl Tagalog 3.9 MiB tgl.traineddata
tha Thai 12.9 MiB tha.traineddata
tir Tigrinya 1.7 MiB tir.traineddata
tur Turkish 13.4 MiB tur.traineddata
uig Uighur; Uyghur 1.9 MiB uig.traineddata
ukr Ukrainian 7.7 MiB ukr.traineddata
urd Urdu 4.6 MiB urd.traineddata
uzb Uzbek 4.1 MiB uzb.traineddata
uzb_cyrl Uzbek – Cyrilic 3.2 MiB uzb_cyrl.traineddata
vie Vietnamese 5.8 MiB vie.traineddata
yid Yiddish 4.0 MiB yid.traineddata

