Hi Ramon,

I do not have source files for dawg dictionaries and I am not able to
"decompile" them. Anyway I think to create dictionaries is the easiest
part of tesseract training: based on wiki[1] input is simple utf-8 file
with one word per line. This file is split to several files:

    * lang.punc    -> words with punctuation patterns
    * lang.number    -> words with number patterns
    * lang.freq    -> frequent words
    * lang.word    -> rest of the words

I believe you can get list of words from other opensource projects (e.g.
spellchecker, dictionary projects as apertium.org, or search for free
Catalan Corpus - do not forget to clear license of data first!) or you
can create it from wikipedia[2].

dawg files are easy to create (big input file can cause a long run this
command!):

    $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file


e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset

This command is valid for tesseract 3.00. wordlist2dawg in tesseract
2.04 do not use unicharset_file as input.

I hope there will be more details soon on
http://www.sk-spell.sk.cx/tesseract-ocr-en.

[1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
[2] http://wiki.apertium.org/wiki/Building_dictionaries

Zdenko

Dn(a 29.04.2010 09:30, Ramon  wrote / napísal(a):
> Hi for you quick answer Zdenko.
>
> As you pointed out, I'm already using tif / box pair from spanish
> language to train my catalan .traineddata language. (As spanish
> characters suits catalan characters too).
>
> But doing just this (with no words in dictionary files) the dictionary
> is not quite good. I think the difference is from the words used in
> those dictionaries. So I'm asking for that utf8 files...
>
> Don't know if you (or a developer) can provide them.
>
> Thanks.
>
> Ramon.
>
>
>
>
> On 28 Abr, 15:55, zdenko podobny <zde...@gmail.com> wrote:
>   
>> Hello Ramon,
>>
>> for extending existing language you need "Tif/Box pairs" 
>> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQand there "How do I add 
>> just
>> one character or one font to my favourite language, without having to
>> retrain from scratch?"
>>
>> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld
>> and spa languages... So you can wait that somebody will someday release
>> tif/box pairs for your language or you will start training from scratch. I
>> choose second option and this is reason why I started with testing of
>> training process for  tesseract 3.00.
>>
>> BR,
>>
>> Zdenko
>>
>>
>>
>>
>>
>> On Mon, Apr 26, 2010 at 11:06 AM, Ramon <rsal...@gmail.com> wrote:
>>     
>>> Hi,
>>> After some tests I realized the best for me is to put effort to extend
>>> the Catalan Diccionari which is in svn repository (v3).
>>> It will be so useful if you can do one of these:
>>>       
>>     
>>> -> deliver the different files combined to create the cat.traineddata
>>> unified file. (the utf8 files used to generate the dawg would be also
>>> amazing!).
>>> -> show how to extract these files from the cat.traineddata and how to
>>> dawg2utf8 (if it is possible).
>>>       
>>     
>>> THANKS!
>>>       
>>     
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@goog 
>>> legroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>       
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-...@googlegroups.com.
>> To unsubscribe from this group, send email to 
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> For more options, visit this group 
>> athttp://groups.google.com/group/tesseract-ocr?hl=en.
>>     
>   

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to