Re: Extracting files from .tessdata

2010-05-22 Thread Zdenko Podobný
Hello Ramon, tesseract-ocr is developed by google (see http://groups.google.com/group/tesseract-ocr/msg/7408c699e27db341). I hope that after solving all/some issues final version of tesseract-ocr 3.00 will be released including tif+box files... Zd. Dn(a 20.05.2010 10:53, Ramon wrote / napísal(a

Re: Extracting files from .tessdata

2010-05-21 Thread Jimmy O'Regan
On 20 May 2010, at 09:53, Ramon wrote: Hi Zdenko, After some tests, I realized I need the tiff pair boxes that the creators used to generate Catalan tessdata file. Do you know a way to contact to them? That might be difficult. As you said before, you might be able to reuse the Spanish fil

Re: Extracting files from .tessdata

2010-05-21 Thread Ramon
Hi Zdenko, After some tests, I realized I need the tiff pair boxes that the creators used to generate Catalan tessdata file. Do you know a way to contact to them? Ramon. On 29 Abr, 23:49, Zdenko Podobný wrote: > Hi Ramon, > > I do not have source files for dawg dictionaries and I am not abl

Re: Extracting files from .tessdata

2010-04-29 Thread Zdenko Podobný
Hi Ramon, I do not have source files for dawg dictionaries and I am not able to "decompile" them. Anyway I think to create dictionaries is the easiest part of tesseract training: based on wiki[1] input is simple utf-8 file with one word per line. This file is split to several files: * lang.pu

Re: Extracting files from .tessdata

2010-04-29 Thread Ramon
Hi for you quick answer Zdenko. As you pointed out, I'm already using tif / box pair from spanish language to train my catalan .traineddata language. (As spanish characters suits catalan characters too). But doing just this (with no words in dictionary files) the dictionary is not quite good. I t

Re: Extracting files from .tessdata

2010-04-28 Thread zdenko podobny
Hello Ramon, for extending existing language you need "Tif/Box pairs" see http://code.google.com/p/tesseract-ocr/wiki/FAQ and there "How do I add just one character or one font to my favourite language, without having to retrain from scratch?" Unfortunately tif/box pairs are provided only for eng

Extracting files from .tessdata

2010-04-28 Thread Ramon
Hi, After some tests I realized the best for me is to put effort to extend the Catalan Diccionari which is in svn repository (v3). It will be so useful if you can do one of these: -> deliver the different files combined to create the cat.traineddata unified file. (the utf8 files used to generate t