Re: Extracting files from .tessdata

2010-04-29 Thread Zdenko Podobný
Hi Ramon, I do not have source files for dawg dictionaries and I am not able to "decompile" them. Anyway I think to create dictionaries is the easiest part of tesseract training: based on wiki[1] input is simple utf-8 file with one word per line. This file is split to several files: * lang.pu

Re: Tesseract 3.0 without page layout analysis?

2010-04-29 Thread Zdenko Podobný
Hi Patrick, Do you have experience that it works (e.g. it produces different output for different "Page seg mode")? I tried several options but I got the same output. I used scan of 4 column magazine page as input file. Maybe I did something wrong, maybe I do not understand what should be result.

Re: not words

2010-04-29 Thread namenick
that was awesome. thanks... On Apr 15, 5:10 pm, namenick wrote: > hi all... > > is there a way to instruct tesseract to ignore anything that is not > trained to read. like the lines around the date and time in this > image:http://quereven.com/images/moo_time.jpg > > the simple reason is that wi

Re: Extracting files from .tessdata

2010-04-29 Thread Ramon
Hi for you quick answer Zdenko. As you pointed out, I'm already using tif / box pair from spanish language to train my catalan .traineddata language. (As spanish characters suits catalan characters too). But doing just this (with no words in dictionary files) the dictionary is not quite good. I t