Hi Ramon, I do not have source files for dawg dictionaries and I am not able to "decompile" them. Anyway I think to create dictionaries is the easiest part of tesseract training: based on wiki[1] input is simple utf-8 file with one word per line. This file is split to several files:
* lang.punc -> words with punctuation patterns * lang.number -> words with number patterns * lang.freq -> frequent words * lang.word -> rest of the words I believe you can get list of words from other opensource projects (e.g. spellchecker, dictionary projects as apertium.org, or search for free Catalan Corpus - do not forget to clear license of data first!) or you can create it from wikipedia[2]. dawg files are easy to create (big input file can cause a long run this command!): $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset This command is valid for tesseract 3.00. wordlist2dawg in tesseract 2.04 do not use unicharset_file as input. I hope there will be more details soon on http://www.sk-spell.sk.cx/tesseract-ocr-en. [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract [2] http://wiki.apertium.org/wiki/Building_dictionaries Zdenko Dn(a 29.04.2010 09:30, Ramon wrote / napísal(a): > Hi for you quick answer Zdenko. > > As you pointed out, I'm already using tif / box pair from spanish > language to train my catalan .traineddata language. (As spanish > characters suits catalan characters too). > > But doing just this (with no words in dictionary files) the dictionary > is not quite good. I think the difference is from the words used in > those dictionaries. So I'm asking for that utf8 files... > > Don't know if you (or a developer) can provide them. > > Thanks. > > Ramon. > > > > > On 28 Abr, 15:55, zdenko podobny <zde...@gmail.com> wrote: > >> Hello Ramon, >> >> for extending existing language you need "Tif/Box pairs" >> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQand there "How do I add >> just >> one character or one font to my favourite language, without having to >> retrain from scratch?" >> >> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld >> and spa languages... So you can wait that somebody will someday release >> tif/box pairs for your language or you will start training from scratch. I >> choose second option and this is reason why I started with testing of >> training process for tesseract 3.00. >> >> BR, >> >> Zdenko >> >> >> >> >> >> On Mon, Apr 26, 2010 at 11:06 AM, Ramon <rsal...@gmail.com> wrote: >> >>> Hi, >>> After some tests I realized the best for me is to put effort to extend >>> the Catalan Diccionari which is in svn repository (v3). >>> It will be so useful if you can do one of these: >>> >> >>> -> deliver the different files combined to create the cat.traineddata >>> unified file. (the utf8 files used to generate the dawg would be also >>> amazing!). >>> -> show how to extract these files from the cat.traineddata and how to >>> dawg2utf8 (if it is possible). >>> >> >>> THANKS! >>> >> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "tesseract-ocr" group. >>> To post to this group, send email to tesseract-...@googlegroups.com. >>> To unsubscribe from this group, send email to >>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@goog >>> legroups.com> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To post to this group, send email to tesseract-...@googlegroups.com. >> To unsubscribe from this group, send email to >> tesseract-ocr+unsubscr...@googlegroups.com. >> For more options, visit this group >> athttp://groups.google.com/group/tesseract-ocr?hl=en. >> >
smime.p7s
Description: S/MIME Cryptographic Signature