from:"Zdenko Podobný"

Re: Disable Special characters?

2010-04-21 Thread Zdenko Podobný

Hello,

maybe this problems if tesseract is not installed to standard place and
there is not environment setting (export TESSDATA_PREFIX="directory in
which your tessdata resides/") as mentioned in
http://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes.

I have (on linux) tesseract 2.04 installed in /usr and tesseract 3.00 in
/usr/local/ and I do not need to specify path to config files if I put
them to expected place (e.g. tesseract expect to have config files in
/usr/share/tessdata/configs/ in case of 2.04 and in
/usr/local/share/tessdata/configs/ in case of 3.00).

Zd.

Dn(a 20.04.2010 10:08, Neil Benn  wrote / napísal(a):
> Hello,
>
> The main wiki page says that you do not need to specify the path to
> the conf files but if you scroll down to the comments then someone has added
> in that you do (thanks to that person!).  I'm running on Linux and I do need
> to specify the full path to the config files rather than assume they are in
> tessdata/config.
>
> Cheers,
>
> Neil
>
> On 19 April 2010 08:07, MARTIN Pierre  wrote:
>
>   
>> if I correctly understood "Comment by ffournel, Mar 30, 2010" on
>> http://code.google.com/p/tesseract-ocr/wiki/FAQ we can achieved the same
>> behavior by creating config file (e.g. digits in directory
>> tessdata/configs/) with line:
>>
>> tessedit_char_whitelist 0123456789
>> and than to run:
>> C:>tesseract.exe nine.tif out tessdata/configs/nobatch
>> tessdata/configs/digits
>>
>> Exactly, but i think you don't have to specify the tessdata/configs paths
>> each time for each conf.
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>> 
>
>
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

2010-04-23 Thread Zdenko Podobný

Hello,,

please read wiki pages http://code.google.com/p/tesseract-ocr/wiki
especially http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
where is described training process for tesseract 2.04

In svn (http://code.google.com/p/tesseract-ocr/source/checkout) there is
already (pre?) release of version 3.00 with language data also for your
language (See
http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata%3Fstate%3Dclosed).


Based on some remarks on wikipages training process should be different
+ see posting in this forum. There is no information when 3.00 will be
released.

Zd.

Dn(a 23.04.2010 16:28, Lars Aronsson  wrote / napísal(a):
> I'm the founder of Project Runeberg, the Scandinavian
> volunteer book scanning project, http://runeberg.org/
> where we have mainly been using Abbyy Finereader,
> with subsequent manual, online proofreading.
> I'm also involved in Wikisource, the book scanning
> and proofreading project of the Wikimedia Foundation.
>
> Is anybody training Tesseract to read Swedish and
> other Scandinavian languages? Is there a tutorial
> for how to train new languages in Tesseract?
>
> I'm running Ubuntu Linux 9.10. The included package
> for Tesseract 2.03 contains man pages that are next
> to useless. There seem to be some programs: mftraining,
> cntraining, unicharset_extractor, but they talk about
> "box files" and I have no clue what these are.
>
> In Project Runeberg, we already have 186,000 pages
> that are fully proofread, mostly in Swedish and
> Danish, in various fonts and from different years,
> meaning different spelling standards. Could these
> be used for training Tesseract? How do I start?
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: TRAINING ... Font name = UnknownFont.

2010-04-24 Thread Zdenko Podobný


Dňa 19.04.2010 09:05, MARTIN Pierre wrote / napísal(a):
> Hello Zdpo,
>
> As said in my mail on 13th of April, as an answer to Sriranga:
>
>   
>>> I am extremely thankful  for the attachment. I could not understand "OCRB 
>>> font" - which I don't have. It is presumed any fonts can do/be used ?
>>>   
>> Exactly. Basically, you'll have to create your custom language which will 
>> still contain a certain number of fonts. Each font can be train with 
>> multiple pictures. That's why the file names for the boxes are decomposed 
>> this way: xxx.F.ppp.box (xxx=language, FFF=font, ppp=page if you have 
>> multiple training pictures by font), this way the files are better organised.
>> 
> As you can see, the names of the input files when training Tesseract 
> (Especially the .tr files) are determining the font names.
>
> This is visible in the source code too, if you make a search for 
> "CurrentFont" in the whold source code, you'll see what i mean.
>
> Pierre.
>
>   
When I make tests on linux I experienced crash of tesseract... I tried
to understood source code (+ to some work with debuger ;-) ) and I think
there is a bug (or at least code did not handle possible inputs
correctly). My experience (+ patch for my problems) can be found on
http://www.sk-spell.sk.cx/tesseract-ocr-en-language-training-300...

Zdenko



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Tesseract 3.0 without page layout analysis?

2010-04-29 Thread Zdenko Podobný

Hi Patrick,

Do you have experience that it works (e.g. it produces different output
for different "Page seg mode")?

I tried several options but I got the same output. I used scan of 4
column magazine page as input file.
Maybe I did something wrong, maybe I do not understand what should be
result...

I created new config file (/usr/local/share/tessdata/tessconfigs/PSM)
with line:

tessedit_pageseg_mode 3

and than I run:
$ /usr/local/bin/tesseract multicolumn.tif ouput_3 PSM

tesseract accepted config file (if I replace "3" with "PSM_SINGLE_LINE"
tesseract will complain that "variable not found: tessedit_pageseg_mode"
- there must be number as explain in ccmain/tesseractclass.cpp:
"Page seg mode: 0=auto, 1=col, 2=block, 3=line, 4=word, 6=char")

When I add to config file another line:

tessedit_dump_pageseg_images true

it produces the same images as input image even I use different "Page
seg mode..." (I expected that it will create different images for
different "Page seg mode")

Zd.

Dn(a 28.04.2010 17:08, patrickq  wrote / napísal(a):
> Hi all,
>
> I stands to reason that can achieve what you want by setting the
> segmentation mode. This is how we use that setting:
>myTess->SetPageSegMode(tesseract::PSM_AUTO);
>
> We use PSM_AUTO in our iPhone app (ScanBizCards) but for small images
> perhaps using another mode will achieve what you need. Here is the
> list of options:
>
>PSM_AUTO,   // Fully automatic page segmentation.
>PSM_SINGLE_COLUMN,  // Assume a single column of text of 
> variable
> sizes.
>PSM_SINGLE_BLOCK,   // Assume a single uniform block of text.
> (Default.)
>PSM_SINGLE_LINE,// Treat the image as a single text line.
>PSM_SINGLE_WORD,// Treat the image as a single word.
>PSM_SINGLE_CHAR,// Treat the image as a single character.
>
> Patrick
>
> On Apr 28, 9:56 am, zdenko podobny  wrote:
>   
>> If find how to turn it off, please share this info ;-)
>>
>> Zd.
>>
>>
>>
>> On Sun, Apr 25, 2010 at 5:43 PM, Jan  wrote:
>> 
>>> Thanks for the info, when I will try to change in the
>>> tesseractmain.cpp.
>>>   
>> 
>>> Jan
>>>   
>> 
>>> On 23 Apr., 09:38, zdenko podobny  wrote:
>>>   
 Hello,
 
>> 
 http://code.google.com/p/tesseract-ocr/wiki/ReadMe, section Installation
 Notes - 3.00 Prerelease:
 In the executable, page layout analysis is enabled by default. You may
 
>>> need
>>>   
 to turn it off to process small images. No command-line control for this
 yet. Sorry. See tesseractmain.cpp.
 
>> 
 Zd.
 
>> 
 On Wed, Apr 21, 2010 at 10:08 AM, Jan  wrote:
 
> Hallo,
> is it possible to use tesseract 3.0 without page layout analysis, or
> in one column mode?
> Especially using the tesseract.exe?
> Thanks!!
>   
>> 
> --
> You received this message because you are subscribed to the Google
>   
>>> Groups
>>>   
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-...@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
>   
>>> 
>>>   
>> 
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>   
>> 
 --
 You received this message because you are subscribed to the Google Groups
 
>>> "tesseract-ocr" group.
>>>   
 To post to this group, send email to tesseract-...@googlegroups.com.
 To unsubscribe from this group, send email to
 
>>> tesseract-ocr+unsubscr...@googlegroups.com
>>> .
>>>   
 For more options, visit this group athttp://
 
>>> groups.google.com/group/tesseract-ocr?hl=en.
>>>   
>> 
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>   
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-...@googlegroups.com.
>> To unsubscribe from this group, send email to 
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> For more options, visit this group 
>> athttp://groups.google.com/group/tesseract-ocr?hl=en.
>> 
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Extracting files from .tessdata

2010-04-29 Thread Zdenko Podobný

Hi Ramon,

I do not have source files for dawg dictionaries and I am not able to
"decompile" them. Anyway I think to create dictionaries is the easiest
part of tesseract training: based on wiki[1] input is simple utf-8 file
with one word per line. This file is split to several files:

* lang.punc-> words with punctuation patterns
* lang.number-> words with number patterns
* lang.freq-> frequent words
* lang.word-> rest of the words

I believe you can get list of words from other opensource projects (e.g.
spellchecker, dictionary projects as apertium.org, or search for free
Catalan Corpus - do not forget to clear license of data first!) or you
can create it from wikipedia[2].

dawg files are easy to create (big input file can cause a long run this
command!):

$ wordlist2dawg [-t] word_list_file dawg_file unicharset_file


e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset

This command is valid for tesseract 3.00. wordlist2dawg in tesseract
2.04 do not use unicharset_file as input.

I hope there will be more details soon on
http://www.sk-spell.sk.cx/tesseract-ocr-en.

[1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
[2] http://wiki.apertium.org/wiki/Building_dictionaries

Zdenko

Dn(a 29.04.2010 09:30, Ramon  wrote / napísal(a):
> Hi for you quick answer Zdenko.
>
> As you pointed out, I'm already using tif / box pair from spanish
> language to train my catalan .traineddata language. (As spanish
> characters suits catalan characters too).
>
> But doing just this (with no words in dictionary files) the dictionary
> is not quite good. I think the difference is from the words used in
> those dictionaries. So I'm asking for that utf8 files...
>
> Don't know if you (or a developer) can provide them.
>
> Thanks.
>
> Ramon.
>
>
>
>
> On 28 Abr, 15:55, zdenko podobny  wrote:
>   
>> Hello Ramon,
>>
>> for extending existing language you need "Tif/Box pairs" 
>> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQand there "How do I add 
>> just
>> one character or one font to my favourite language, without having to
>> retrain from scratch?"
>>
>> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld
>> and spa languages... So you can wait that somebody will someday release
>> tif/box pairs for your language or you will start training from scratch. I
>> choose second option and this is reason why I started with testing of
>> training process for  tesseract 3.00.
>>
>> BR,
>>
>> Zdenko
>>
>>
>>
>>
>>
>> On Mon, Apr 26, 2010 at 11:06 AM, Ramon  wrote:
>> 
>>> Hi,
>>> After some tests I realized the best for me is to put effort to extend
>>> the Catalan Diccionari which is in svn repository (v3).
>>> It will be so useful if you can do one of these:
>>>   
>> 
>>> -> deliver the different files combined to create the cat.traineddata
>>> unified file. (the utf8 files used to generate the dawg would be also
>>> amazing!).
>>> -> show how to extract these files from the cat.traineddata and how to
>>> dawg2utf8 (if it is possible).
>>>   
>> 
>>> THANKS!
>>>   
>> 
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com>> legroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>   
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-...@googlegroups.com.
>> To unsubscribe from this group, send email to 
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> For more options, visit this group 
>> athttp://groups.google.com/group/tesseract-ocr?hl=en.
>> 
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Benefit of the dictionary

2010-05-01 Thread Zdenko Podobný

Hi,

version of the ambigs file for tesseract 3.00 is 'v1' (it means that
next version can bring another version of ambigs file
with new format/features). In ccutil/ambigs.cpp I found only one test
for version (>0) : it is connected to meaning of last column...

If you want to play/create you own lang.unicharambigs maybe it would be
good to use these variables:

global_ambigs_debug_level 1
global_tessedit_ambigs_training true

it will produce some additional informations.

Concerning UTF-8 I tried to do quick test but I was not able to create
image with mistake in utf-8 letter (tesseract made mistake only on ascii
letter ;-) ). But it make one mistake:
„
interpreted as:
,,

When I created unicharambigs like this:
v1
2, ,1„1

Than tesseract recognize „ correctly in output („ is utf-8).
So maybe issues regarding utf-8 are solved in lang.unicharambigs.
However,  you should make own (more extensive) tests for Kannada or
Indic lang.


Zdenko.

Dňa 01.05.2010 03:55, Sriranga(77yrsold)  wrote / napísal(a):
> Hi,
> In your additional comments, it is stated as "first line determine the
> version of the ambigs file." -how to
> determine the version of the ambigs file? Whether the ambigs file of
> tess.3.o is supported for utf-8 say Kannada or any of Indic lang? Previous
> version of tesseract 2.xx did not support utf-8
> With regards,
> -sriranga(77yrsold)
>
> 2010/5/1 Zdenko Podobný 
>
>   
>>  Hi,
>>
>> I made a test with tesseract 3.00: I created English traineddata without
>> dawg dictionaries (eng_nodict.traineddata)  and than I run tesseract to see
>> difference (on file phototest.tif)
>> As you can see dictionary improved result especially in case of "l" vs.
>> "1".
>>
>> I put some additional comments here:
>> http://www.sk-spell.sk.cx/tesseract-ocr-en-dictionary-creating
>>
>> So dictionaries helps to improve results...
>>
>>  Zdenko
>>
>> Dňa 30.04.2010 19:47, M. Bashir Al-Noimi  wrote / napísal(a):
>>
>> Hi folks,
>>
>> Could you tell me what's the benefit of the dictionary in Tesseract? Does
>> it affect on recognizing decision (the result)?
>>
>> I ask this question because I'm planning to use Tesseract for recognizing
>> singles of characters not complete words.
>>
>>
>> 
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Training tesseract for hand written letters

2010-05-17 Thread Zdenko Podobný

Hello,

can you provide more information (OS? how did you installed Tesseract?)

Zd.

Dn(a 08.05.2010 19:53, Thilanka Kaushalya  wrote / napísal(a):
> Hi,
>
>   I'm a doing a handwritten character recognition using Tesseract. I
> tried to train the Tesseract exe for my data set. on windows
> I have followed the guide at the wiki.
> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract . But I could
> not do that.
> These are the steps I have done.
> Downloaded the Tesseract
>
> 2.04
> 
>Create a folder named tessdata in that folder
> Then created the following files in the tessdata folder.
>
>
>- tessdata/eng.freq-dawg
>- tessdata/eng.word-dawg
>- tessdata/eng.user-words
>- tessdata/eng.inttemp
>- tessdata/eng.normproto
>- tessdata/eng.pffmtable
>- tessdata/eng.unicharset
>- tessdata/eng.DangAmbigs
>
> Then I have a tiff image which contains English letter a in
> the root folder.
> Then I have entered the following command.
>
>
> tesseract a.tif fontfile batch.nochop makebox
>
>
> But in this case it gives an error saying ( read_variables_file:Can't open
> ./tessdata/configs/makeboxUnable to load unichars et file
> ./tessdata/eng.unicharset)
>
> please can someone help me to fix this issue. Thanks in advance.
>
> Regards,
> Thilanka.
>
>
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Tesseract on RH EL4 TIFFCleanup undefined symbol error

2010-05-18 Thread Zdenko Podobný

Hello

I usually get this error if I mix libraries (e.g. if I have old version
of library but your library/program expect newer version of library). If
you did not compile library by yourself, you should complain to packager
of you program for wrong handling of dependencies...

If you compiled something from source (e.g. there could be more version
of the same library in you system) you should care about correct linking
of libraries...

You can find required libraries this way:

which tesseract

(in my case it produce '/usr/bin/tesseract'). Than use this command:

ldd /usr/bin/tesseract

BR,

Zd.

Dn(a 11.05.2010 00:49, Greg  wrote / napísal(a):
> So, I've installed Tesseract on redhat EL4. It was quite a lot of
> effort and Google searching.
>
>   
>> tesseract page9.tif out -l eng
>> 
> Tesseract Open Source OCR Engine with Leptonica
> tesseract: symbol lookup error: tesseract: undefined symbol:
> TIFFCleanup
>
>   
>> 
> Any idea what this error message means? I can't find it with Google.
>
> Perhaps it's related to my libtiff-devel dependancy. If so, anyone
> know where to find an rpm for EL4?
>
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Integrating Tesseract with another open source project

2010-05-22 Thread Zdenko Podobný

see http://code.google.com/p/tesseract-ocr/wiki/ReadMe:

Another important change is that you should *really* be using
TessBaseAPI if you are linking with another program. In Linux
(non-Windows) the main library is now libtesseract_api.a instead of
the old libtesseract_full.a. In windows, use the define
TESSDLL_IMPORTS before including baseapi.h in your code to get the
symbols of the TessBaseAPI class.


Zd.

Dn(a 21.05.2010 19:21, Thilanka  wrote / napísal(a):
> Hi,
>
> I'm working with a the Sahana OCR project for my gsoc session.
> In this I'm planning to use Tesseract for the character recognition in
> the Sahana OCR project(is it an opensource project). The Sahana OCR
> code has written in Visual C++. We cannot use the Tesseract exe for
> our project. So I'm planing to join the Tesseract code with the Sahana
> OCR code. But I don't have a good understanding about the Tesseract
> Architecture and how I can integrate the two sources codes of the
> Sahana and Tesseract together. So can some one please helpm me on this
> problem.
>
> Regards,
> Thilanka.
>
>
>
> --
> http://coders-view.blogspot.com/
> http://thilankagekawuluwa.blogspot.com/
> http://twitter.com/thilanka_k
>
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Extracting files from .tessdata

2010-05-22 Thread Zdenko Podobný

Hello Ramon,

tesseract-ocr is developed by google (see
http://groups.google.com/group/tesseract-ocr/msg/7408c699e27db341). I
hope that after solving all/some issues final version of tesseract-ocr
3.00 will be released including tif+box files...

Zd.

Dn(a 20.05.2010 10:53, Ramon  wrote / napísal(a):
> Hi Zdenko,
>
> After some tests, I realized I need the tiff pair boxes that the
> creators used to generate Catalan tessdata file.
>
> Do you know a way to contact to them?
>
> Ramon.
>
>
>
>
> On 29 Abr, 23:49, Zdenko Podobný  wrote:
>   
>> Hi Ramon,
>>
>> I do not have source files for dawg dictionaries and I am not able to
>> "decompile" them. Anyway I think to create dictionaries is the easiest
>> part of tesseract training: based on wiki[1] input is simple utf-8 file
>> with one word per line. This file is split to several files:
>>
>> * lang.punc-> words with punctuation patterns
>> * lang.number-> words with number patterns
>> * lang.freq-> frequent words
>> * lang.word-> rest of the words
>>
>> I believe you can get list of words from other opensource projects (e.g.
>> spellchecker, dictionary projects as apertium.org, or search for free
>> Catalan Corpus - do not forget to clear license of data first!) or you
>> can create it from wikipedia[2].
>>
>> dawg files are easy to create (big input file can cause a long run this
>> command!):
>>
>> $ wordlist2dawg [-t] word_list_file dawg_file unicharset_file
>>
>> e.g. wordlist2dawg lang.punc lang.punc-dawg lang.unicharset
>>
>> This command is valid for tesseract 3.00. wordlist2dawg in tesseract
>> 2.04 do not use unicharset_file as input.
>>
>> I hope there will be more details soon 
>> onhttp://www.sk-spell.sk.cx/tesseract-ocr-en.
>>
>> [1]http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
>> [2]http://wiki.apertium.org/wiki/Building_dictionaries
>>
>> Zdenko
>>
>> Dn(a 29.04.2010 09:30, Ramon  wrote / napísal(a):
>>
>>
>>
>> 
>>> Hi for you quick answer Zdenko.
>>>   
>> 
>>> As you pointed out, I'm already using tif / box pair from spanish
>>> language to train my catalan .traineddata language. (As spanish
>>> characters suits catalan characters too).
>>>   
>> 
>>> But doing just this (with no words in dictionary files) the dictionary
>>> is not quite good. I think the difference is from the words used in
>>> those dictionaries. So I'm asking for that utf8 files...
>>>   
>> 
>>> Don't know if you (or a developer) can provide them.
>>>   
>> 
>>> Thanks.
>>>   
>> 
>>> Ramon.
>>>   
>> 
>>> On 28 Abr, 15:55, zdenko podobny  wrote:
>>>   
>> 
>>>> Hello Ramon,
>>>> 
>> 
>>>> for extending existing language you need "Tif/Box pairs" 
>>>> seehttp://code.google.com/p/tesseract-ocr/wiki/FAQandthere "How do I add 
>>>> just
>>>> one character or one font to my favourite language, without having to
>>>> retrain from scratch?"
>>>> 
>> 
>>>> Unfortunately tif/box pairs are provided only for eng, deu, fra, ita, nld
>>>> and spa languages... So you can wait that somebody will someday release
>>>> tif/box pairs for your language or you will start training from scratch. I
>>>> choose second option and this is reason why I started with testing of
>>>> training process for  tesseract 3.00.
>>>> 
>> 
>>>> BR,
>>>> 
>> 
>>>> Zdenko
>>>> 
>> 
>>>> On Mon, Apr 26, 2010 at 11:06 AM, Ramon  wrote:
>>>> 
>> 
>>>>> Hi,
>>>>> After some tests I realized the best for me is to put effort to extend
>>>>> the Catalan Diccionari which is in svn repository (v3).
>>>>> It will be so useful if you can do one of these:
>>>>>   
>> 
>>>>> -> deliver the different files combined to create the cat.traineddata
>>>>> unified file. (the utf8 files used to generate the dawg would be also
>>>>> amazing!).
>>>>> -> show how to extract these files from the cat.traineddata and how to
>>>>> dawg2utf8 (if it is possible).
>>>>&g

Re: Danish fraktur support in r319

2010-05-24 Thread Zdenko Podobný

Dňa 24.05.2010 19:46, Jimmy O'Regan wrote / napísal(a):
> On 24 May 2010 17:41, Lars Aronsson  wrote:
>   
>> Peter Alberti wrote:
>> 
> I've trained tesseract r319 (3.0) to support Danish texts written in
> fraktur. It is not
> perfect but good enough that I hope it may be useful to others.
>   
>> Jimmy O'Regan wrote:
>> 
>>> With the current SVN version, you can use combine_tessdata -e
>>> [trainingdata file] [files to extract] to extract the components you
>>> want, and combine_tessdata [path to files] to make a new trainingdata
>>> file.
>>>   
>> I tried to compile the current version (svn -r354 up), but failed:
>>
>> svshowim.cpp: In function ‘void sv_show_sub_image(IMAGE*, inT32, inT32,
>> inT32, inT32, ScrollView*, inT32, inT32)’:
>> svshowim.cpp:37: error: no matching function for call to
>> ‘ScrollView::Image(Pix*&, inT32&, int)’
>> ../viewer/scrollview.h:266: note: candidates are: void
>> ScrollView::Image(const char*, int, int)
>>
>> Versions 340, 351, 352, 353 also failed in the same place.
>>
>> 
> Looks like a pair of missing casts - have you opened an issue?
>   
I did: http://code.google.com/p/tesseract-ocr/issues/detail?id=303




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Danish fraktur support in r319

2010-05-24 Thread Zdenko Podobný

Dňa 24.05.2010 20:26, Jimmy O'Regan  wrote / napísal(a):
> 2010/5/24 Zdenko Podobný :
>   
>> Dňa 24.05.2010 19:46, Jimmy O'Regan wrote / napísal(a):
>> 
>>> On 24 May 2010 17:41, Lars Aronsson  wrote:
>>>
>>>   
>>>> Peter Alberti wrote:
>>>>
>>>> 
>>>>>>> I've trained tesseract r319 (3.0) to support Danish texts written in
>>>>>>> fraktur. It is not
>>>>>>> perfect but good enough that I hope it may be useful to others.
>>>>>>>
>>>>>>>   
>>>> Jimmy O'Regan wrote:
>>>>
>>>> 
>>>>> With the current SVN version, you can use combine_tessdata -e
>>>>> [trainingdata file] [files to extract] to extract the components you
>>>>> want, and combine_tessdata [path to files] to make a new trainingdata
>>>>> file.
>>>>>
>>>>>   
>>>> I tried to compile the current version (svn -r354 up), but failed:
>>>>
>>>> svshowim.cpp: In function ‘void sv_show_sub_image(IMAGE*, inT32, inT32,
>>>> inT32, inT32, ScrollView*, inT32, inT32)’:
>>>> svshowim.cpp:37: error: no matching function for call to
>>>> ‘ScrollView::Image(Pix*&, inT32&, int)’
>>>> ../viewer/scrollview.h:266: note: candidates are: void
>>>> ScrollView::Image(const char*, int, int)
>>>>
>>>> Versions 340, 351, 352, 353 also failed in the same place.
>>>>
>>>>
>>>> 
>>> Looks like a pair of missing casts - have you opened an issue?
>>>
>>>   
>> I did: http://code.google.com/p/tesseract-ocr/issues/detail?id=303
>>
>> 
> Weird. It's there:
>337 theraysmith #ifdef HAVE_LIBLEPT
>149 theraysmith // Draw a Pix on (x,y).
>149 theraysmith   void Image(struct Pix* image, int x_pos, int y_pos);
>337 theraysmith #endif
>
> The only thing I can think of is that you might need to make clean
> before running make.
>
>
>   
I deleted whole tesseract directory and I downloaded fresh svn copy ()
Than I run ./runautoconf; ./configure; make
and I got the same result...

So I deleted lines 204 and 207 (#ifdef HAVE_LIBLEPT , #endif ) from
viewer/scrollview.h

than compilation continued up to this error:

g++ -DHAVE_CONFIG_H -I. -I..  -I../ccutil -I../ccstruct -I../image
-I../viewer -I../ccops -I../dict -I../classify -I../ccmain -I../wordrec
-I../cutil -I../textord -I/usr/local/include/liblept  -g -O2 -MT
tesseractmain.o -MD -MP -MF .deps/tesseractmain.Tpo -c -o
tesseractmain.o tesseractmain.cpp
tesseractmain.cpp: In function ‘int main(int, char**)’:
tesseractmain.cpp:227: error: ‘pix’ was not declared in this scope

So I put declaration to line 225 in api/tesseractmain.cpp:
PIX *pix;

Than compilation continue but soon it fails on with this message:

g++  -g -O2   -o tesseract tesseractmain.o libtesseract_api.a -llept
-ltiff -lpthread -ljpeg -lpng -lz  -lm
libtesseract_api.a(libtesseract_api.o): In function
`sv_show_sub_image(IMAGE*, int, int, int, int, ScrollView*, int, int)':
/usr/src/tesseract-ocr-r354/image/svshowim.cpp:37: undefined reference
to `ScrollView::Image(Pix*, int, int)'
collect2: ld returned 1 exit status
make[3]: *** [tesseract] Error 1

I am not able to solve this. It looks to me like problem with leptonlib
(I have leptonlib-1.65) - all problems where between #ifdef HAVE_LIBLEPT
, #endif. But I was able to build r326 without problem with leptonlib-1.65.

Zd.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Danish fraktur support in r319

2010-05-24 Thread Zdenko Podobný


Dn(a 24.05.2010 21:39, Lars Aronsson  wrote / napísal(a):
> Jimmy O'Regan wrote:
> ‘ScrollView::Image(Pix*&, inT32&, int)’
> ../viewer/scrollview.h:266: note: candidates are: void
> ScrollView::Image(const char*, int, int)
>
>> Weird. It's there:
>>337 theraysmith #ifdef HAVE_LIBLEPT
>>149 theraysmith // Draw a Pix on (x,y).
>>149 theraysmith   void Image(struct Pix* image, int x_pos, int
>> y_pos);
>>337 theraysmith #endif
>>
>> The only thing I can think of is that you might need to make clean
>> before running make.
>
> I should add that I have an AMD Athlon 64 bit CPU.
> The bug report filed by Zdenko also says
> "tesseract-ocr r354
> ,
> Mandrivalinux 2010.1 64bit",
> but the compiler error message is full of "inT32"
> and the prototype above says "int".
>
>
I think I find workaround:
1. Add "PIX *pix;" to line 225 in api/tesseractmain.cpp
2. Than configure&compile tesseract with these commands:

./runautoconf
CPPFLAGS="-DHAVE_LIBLEPT"./configure ; make
make install
  

Can you confirm if it works for you?

Than I run:

/usr/local/bin/tesseract eurotext.tif output2


It will produce error message:

Tesseract Open Source OCR Engine with Leptonica

Warning in pixReadStreamTiff: tiff page 1 not found

Error in pixReadTiff: pix not read


But also output (output2.txt) :-[

Zd.


  



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Call for testers...

2010-05-26 Thread Zdenko Podobný

Hello,

compilation process without problem:

./runautoconf
./configure
make


Than I installed it:

sudo make install


When I tried to run it:

/usr/local/bin/tesseract


I got error:

/usr/local/bin/tesseract: error while loading shared libraries:
libtesseract_api.so.3: cannot open shared object file: No such file
or directory

After I run:

sudo ldconfig

it start to works

sudo make uninstall

Finished with error:

Making uninstall in java
make[1]: Entering directory `/usr/src/tesseract-ocr-r370/java'
make[1]: *** No rule to make target `uninstall'.  Stop.
make[1]: Leaving directory `/usr/src/tesseract-ocr-r370/java'
make: *** [uninstall-recursive] Error 1


When I run 'sudo make uninstall' and than 'sudo make install' I do not
need to run 'ld_config'. So I not know if there is problem or not. Log
from 'sudo make install' is attached.

Tested on: Mandriva Linux release 2010.1 (Cooker) for x86_64

Zd.

Dn(a 26.05.2010 16:23, Jimmy O'Regan  wrote / napísal(a):
> I've just updated the SVN version to use libtool (and shared
> libraries, that sort of thing) but it's only tested on Ubuntu Lucid.
>
> Anyone care to take it for a test run?
>
>   


make.log.gz
Description: GNU Zip compressed data


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Call for testers...

2010-05-26 Thread Zdenko Podobný



Dňa 26.05.2010 20:52, Jimmy O'Regan wrote / napísal(a):
> 2010/5/26 Zdenko Podobný :
>   
>> Hello,
>>
>> compilation process without problem:
>>
>> ./runautoconf
>> ./configure
>> make
>>
>> Than I installed it:
>>
>> sudo make install
>>
>> When I tried to run it:
>>
>> /usr/local/bin/tesseract
>>
>> I got error:
>>
>> /usr/local/bin/tesseract: error while loading shared libraries:
>> libtesseract_api.so.3: cannot open shared object file: No such file or
>> directory
>>
>> After I run:
>>
>> sudo ldconfig
>>
>> it start to works
>>
>> sudo make uninstall
>>
>> Finished with error:
>>
>> Making uninstall in java
>> make[1]: Entering directory `/usr/src/tesseract-ocr-r370/java'
>> make[1]: *** No rule to make target `uninstall'.  Stop.
>> make[1]: Leaving directory `/usr/src/tesseract-ocr-r370/java'
>> make: *** [uninstall-recursive] Error 1
>>
>> 
> I didn't do anything with the Java stuff - did that work before?
>
>   
I tested uninstall in r319 and it did not worked there...




smime.p7s
Description: S/MIME Cryptographic Signature

Re: Call for testers...

2010-05-27 Thread Zdenko Podobný


Dňa 26.05.2010 21:28, Jimmy O'Regan  wrote / napísal(a):
> 2010/5/26 Zdenko Podobný :
>   
>> Dňa 26.05.2010 20:52, Jimmy O'Regan wrote / napísal(a):
>> 
>>>>
>>>> 
>>> I didn't do anything with the Java stuff - did that work before?
>>>
>>>
>>>   
>> I tested uninstall in r319 and it did not worked there...
>>
>> 
> Ok, glad to know I didn't break something I hadn't been aware of touching :D
>
>   
I found reason/solution: there is also (not finished) 'makefile' (in
java directory) not handled by autotools ;-) and it has priority before
using 'Makefile'.
So solve this issue just add the end of 'makefile' these 2 lines (or to
include relevant parts of makefile to Makefile.in and remove makefile):

uninstall uninstall-am:
make -f Makefile $@

otherwise will tesseract can not be uninstalled by 'make uninstall'.

Zd.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Generated ZERO tr..BLOBS IN r379 and 869 tr.blobs in r-319

2010-06-05 Thread Zdenko Podobný

Dn(a 05.06.2010 14:57, Jimmy O'Regan  wrote / napísal(a):
> On Saturday, June 5, 2010, zdpo  wrote:
>   
>> Dear Sriranga,
>>
>> your box file is wrong (for tesseract 3.0 and >r319). It did not match
>> to description in "Make Box Files" on 
>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.
>>
>> BTW: I am aware of any tool that support this new box format (for
>> multipage tif).
>>
>> 
> it shouldn't matter. The code is supposed to accept the old style too,
> provided that the number of pages is set to zero, which is determined
> by the image reading code, which doesn't work on windows.
>
> If it fails on Linux, then I'd consider it a bug.
>
>   

/usr/local/bin/tesseract slk.arial.001.tif slk.arial.001 makebox 
batch.nochop

created slk.arial.001.box file with 6 columns (last one with 0).
When I run:

/usr/local/bin/unicharset_extractor slk.arial.001.box

output is OK. When I convert it to 2.x format ('awk '{print $1" "$2"
"$3" "$4" "$5}' slk.arial.002.box') and run:

/usr/local/bin/unicharset_extractor slk.arial.002.box

I got errors:

Extracting unicharset from slk.arial.002.box
Box file format error on line 1 ignored
...

Anyway if  tesseract 3.0 of Sriranga produced old format that something
is wrong in (his/windows) installation/compilation process. Or maybe he
just simply mixed outputs from tesseract 2.x with 3.0...

Zd.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Forking tesseract.

2010-06-09 Thread Zdenko Podobný

Hello,

do you intend to release also tiff/box files for (new) languages (in )

Can you provide some short example for punc-dawg and number-dawg file?

BR,

Zd.

Dn(a 25.05.2010 06:44, Ray Smith  wrote / napísal(a):
> I would be very happy for someone to take over maintenance of the autotools
> part of tesseract. Even better if a team of you can do it... I don't get
> much time to deal with that, and it doesn't get much priority, since we have
> our own build system, and windows has to have its own. With someone looking
> after the build side, I am hopeful that, after 3.00 becomes a tarball, I can
> keep the svn trunk fully up-to-date with the source code and then maybe you
> guys can decide when it is a good time to make a new tarball release.
>
> I made a big hole in the issues list last week, and will attempt to work
> through the rest this week, as there are useful patches in there that should
> be applied, and useful bug reports for bugs that can be fixed. WIth the
> issues list down to a more manageable size, it should be easier to keep up
> with it. There is too much for me to manage on my own though, and it is
> overwhelming to see that just about every wiki page has as many comments
> attached as there are open issues
>
> I saved a lot of time by putting a filter on the forum, but that meant I
> didn't look at it either, which is not satisfactory. I created the
> tesseract-dev forum for developers specifically, but it didn't take off. It
> would help to have a division between the more mundane parts of the forum
> and the other items that require my specific attention.
>
> So please, anyone who wants to help out maintain this site, rather than fork
> it, let me know, and I will add you to the list of developers. We are still
> actively developing the code at Google, and I want to be able to get the
> code out where people can use it.
>
> Ray.
>
> On Fri, May 21, 2010 at 5:17 AM, Jimmy O'Regan  wrote:
>
>   
>> On 14 May 2010, at 14:20, MARTIN Pierre  wrote:
>>
>>  I have created new autotools files so that Tesseract can be built as
>> 
> shared libraries (using libtool), which would allow other projects to
> link against it much more easily. Unfortunately, the Linux
> distributions (admittedly just Gentoo so far) are reluctant to use
> these changes without them being accepted upstream.
>
>   
 I sympathize with your position.  For over a year, I have been
 maintaining a local branch tracking the tesseract-ocr svn trunk with
 some patches applied that do pretty much the same thing you're
 describing, for some personal projects.  I've also been building my
 own .debs for Ubuntu for easy deployment in some projects I'm working
 on.

 
>>> i'm still very enthusiast with this project of forking Tesseract. But as i
>>> said before, i won't do it alone, and i had not hear about you guys. What
>>> amount of time and what skills could you be dedicating to this project?
>>>
>>>   
>> FWIW, there has been some recent activity in SVN, and several issues that
>> had patches attached have been committed. If you haven't already submitted
>> an issue+patch, perhaps now is the time to do so.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>>
>> 
>   


smime.p7s
Description: S/MIME Cryptographic Signature

Re: * glibc detected * tesseract: double free or corruption

2010-07-13 Thread Zdenko Podobný

Hi,

I have Mandriva 2010.1 64 bit (upgraded from 2010) and with
tesseract-2.04-5mdv2010.1.

I can run without problem (in xterm or konsole):

/usr/bin/tesseract resized_slide7.tif resized_slide7


that produce output (resized_slide7.txt) .

BUT:

/usr/bin/tesseract resized_slide5.tif resized_slide5
/usr/bin/tesseract resized_slide6.tif resized_slide6


failed with error message:
check_legal_image_size:Error:Only 1,2,4,5,6,8 bpp are supported:16

Also I tried tesseract 3.00 (r402 with leptonica 1.65), but it failed on
resized_slide5.tif and resized_slide6.tif (resized_slide7.tif is ok)
with this messages:
Error in pixReadFromTiffStream: spp not in set {1,3}
Error in pixReadStreamTiff: pix not read
Error in pixReadTiff: pix not read

So it looks like problem with tif (aplha chanel?)...

Zd.

Dn(a 13.07.2010 02:59, msjs08  wrote / napísal(a):
> Hi
> I installed the only version available through the Mandriva control
> centre.  Version 2.04
> The imagereader thing is a gui front end to tesseract. It doesn't
> interpret the error messages.
>
> My first problem was that I wasn't feeding 8bit images to it.
> I'm getting the image recognised now but it still segfaults.
> It seems to occur AFTER it has written the text file.
>
> I'm sending 3 pics to Jimmy O'Regan
>
> Cheers
> SJ
>
>
> On 12/07/10 22:28, zdenko podobny wrote:
>> Hello,
>>
>> How did you installed Tesseract? Which version?
>> Please provide more information.
>>
>> Zd.
>>
>> On Sun, Jul 11, 2010 at 6:16 PM, msjs08 > > wrote:
>>
>>
>> I've installed Tesseract on Mandriva 2010 (64 bit) and I can't get
>> it to run.
>> It just segfaults.
>> I installed gimagereader. This is the error I got when I tried to
>> use gimagereader
>>
>> [r...@desktop test extract]# tesseract slide7.tif textfile.txt
>> Tesseract Open Source OCR Engine
>> *** glibc detected *** tesseract: double free or corruption
>> (!prev): 0x00a80de0 ***
>> === Backtrace: =
>> /lib64/libc.so.6[0x7fa9e5956bf6]
>> /lib64/libc.so.6(cfree+0x6f)[0x7fa9e595b6bf]
>> /lib64/libc.so.6(__cxa_finalize+0xa5)[0x7fa9e591a5f5]
>> /usr/lib64/libtesseract_full.so.2[0x7fa9e68b97b6]
>> === Memory map: 
>> 0040-00537000 r-xp  08:05 179583 
>>   /usr/bin/tesseract
>> 00737000-0073a000 r--p 00137000 08:05 179583 
>>   /usr/bin/tesseract
>> 0073a000-00741000 rw-p 0013a000 08:05 179583 
>>   /usr/bin/tesseract
>> 00741000-007cc000 rw-p  00:00 0
>> 00a72000-00dd9000 rw-p  00:00 0 
>>[heap]
>> 7fa9e000-7fa9e0021000 rw-p  00:00 0
>> 7fa9e0021000-7fa9e400 ---p  00:00 0
>> 7fa9e528d000-7fa9e52a2000 r-xp  08:05 131442 
>>   /lib64/libz.so.1.2.3
>> 7fa9e52a2000-7fa9e54a1000 ---p 00015000 08:05 131442 
>>   /lib64/libz.so.1.2.3
>> 7fa9e54a1000-7fa9e54a2000 rw-p 00014000 08:05 131442 
>>   /lib64/libz.so.1.2.3
>> 7fa9e54a2000-7fa9e54d8000 r-xp  08:05 131605 
>>   /usr/lib64/libjpeg.so.7.0.0
>> 7fa9e54d8000-7fa9e56d8000 ---p 00036000 08:05 131605 
>>   /usr/lib64/libjpeg.so.7.0.0
>> 7fa9e56d8000-7fa9e56d9000 r--p 00036000 08:05 131605 
>>   /usr/lib64/libjpeg.so.7.0.0
>> 7fa9e56d9000-7fa9e56da000 rw-p 00037000 08:05 131605 
>>   /usr/lib64/libjpeg.so.7.0.0
>> 7fa9e56da000-7fa9e56e3000 r-xp  08:05 133782 
>>   /usr/lib64/libjbig.so.1.0.0
>> 7fa9e56e3000-7fa9e58e2000 ---p 9000 08:05 133782 
>>   /usr/lib64/libjbig.so.1.0.0
>> 7fa9e58e2000-7fa9e58e3000 r--p 8000 08:05 133782 
>>   /usr/lib64/libjbig.so.1.0.0
>> 7fa9e58e3000-7fa9e58e6000 rw-p 9000 08:05 133782 
>>   /usr/lib64/libjbig.so.1.0.0
>> 7fa9e58e6000-7fa9e5a3a000 r-xp  08:05 130868 
>>   /lib64/libc-2.10.1.so 
>> 7fa9e5a3a000-7fa9e5c3a000 ---p 00154000 08:05 130868 
>>   /lib64/libc-2.10.1.so 
>> 7fa9e5c3a000-7fa9e5c3e000 r--p 00154000 08:05 130868 
>>   /lib64/libc-2.10.1.so 
>> 7fa9e5c3e000-7fa9e5c3f000 rw-p 00158000 08:05 130868 
>>   /lib64/libc-2.10.1.so 
>> 7fa9e5c3f000-7fa9e5c44000 rw-p  00:00 0
>> 7fa9e5c44000-7fa9e5c5a000 r-xp  08:05 131433 
>>   /lib64/libgcc_s-4.4.1.so.1
>> 7fa9e5c5a000-7fa9e5e59000 ---p 00016000 08:05 131433 
>>   /lib64/libgcc_s-4.4.1.so.1
>> 7fa9e5e59000-7fa9e5e5a000 rw-p 00015000 08:05 131433 
>>   /lib64/libgcc_s-4.4.1.so.1
>> 7fa9e5

Re: Localisation

2010-07-13 Thread Zdenko Podobný


Dn(a 13.07.2010 12:29, Jimmy O'Regan  wrote / napísal(a):
> On 12 July 2010 20:19, Jeffrey Ratcliffe  wrote:
>   
>> On 12 July 2010 17:16, Jimmy O'Regan  wrote:
>> 
>>> Is anybody interested in seeing localisation support in Tesseract?
>>> (Which begs the follow-up question: is anybody willing to contribute
>>> translations for their language(s)?)
>>>   
>> I would add the support, and then upload the .pot to rosetta on
>> launchpad. If you build it, they will come...
>>
>> 
> Ok, we'll call that 2 votes for localisation.
>
> Launchpad? I'll admit that I haven't looked at it in quite a while,
> but I remember it being poor quality software producing poor quality
> translations; I've heard that the quality of the translations has
> improved quite drastically, but I'd still prefer software designed by
> translators for translators, like Pootle.
>
> On a bit of a tangent, while there clearly is a community here, none
> of us, I feel, really 'knows' each other. I wouldn't normally consider
> this an issue, but it is here - I *want* to trust your opinion,
> because you're the Debian maintainer, but I don't 'know' you.
>
> So we'll call that one for Launchpad, one against.
>
> I don't really know Zdenko either, but I am quite familiar with some
> of his work, so I'd consider his opinion in this matter expert, and
> give him the casting vote.
>
>   
:-[ :-\ =-O

Yes, I would like to see translated tesseract.

Zdenko
>> gscan2pdf has at least partial translations for 33 languages - and I
>> have done no more than upload the .pot every release.
>>
>> 
> That brings up something else I'd been thinking about bringing to the
> list; I'll start another thread.
>
>   

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Can't get the user dictionary to work

2010-08-01 Thread Zdenko Podobný

Dn(a 30.07.2010 15:04, patrickq  wrote / napísal(a):
> This what I did:
>
> 1. Created a text file called eng.user-words, containing:
> Chest
> Chestnut
> Floor
> Vice
>
> 2. Placed the file in the tessdata folder (next to eng.traineddata)
>
> 3. Ran recognition on an image returning "Chesf" instead of "Chest"
> and "Fioor" instead of "Floor". Both mistaken "f" and "i" look quite
> right visually so I can only assume their confidence level would be
> low (but I didn't check).
>
> No effect whatsoever - zip. I can only assume that a variable must be
> set or a function needs to be called to turn this on (even though
> there is no mention of needing to set anything in the documentation)
> or (most likely) I just don't understand how this works and the
> dictionary kicks in only on the day or the summer solstice and when
> there is a full moon or something.
>   

I played with strace & grep and I found out that user dictionary is not
used (opened) in standard installation (svn revision 447).

When I set up variable "global_user_words_suffix" to "user-words" (or
something else you like ;-) ) tesseract opened user dictionary file.

global_user_words_suffix can be found in 2 files:
dict/dict.h: extern STRING_VAR_H(global_user_words_suffix, "user-words",
"A list of user-provided words.");
dict/permute.cpp:STRING_VAR(global_user_words_suffix, "", "A list of
user-provided words.");

I believe problem is in dict/permute.cpp that define this variable as
empty string.

Zd.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

2010-08-01 Thread Zdenko Podobný


Dňa 28.07.2010 17:02, Jimmy O'Regan wrote / napísal(a):
>
 I grepped the code and it seems to be looking for something called
 LANG.user-words, but that didn't seem to do anything -- I got the same
 garbled text when I ran Tesseract 3 the second time.
 
>> Turns out T3 doesn't even access $LANG.user-words. I suspect it's looking
>> for it in the traineddata file...
>>
>> 
> Hmm... probably... which is quite a stupid thing to do, really, but I
> presume nobody in Google actually uses this, so it's probably quite
> neglected.
>
> I'm toying with the idea of adding support for an actual *user* list -
> i.e., that tesseract would check $HOME/.tesseract/lang.user-words -
> because assuming a single user system that the user has full control over is 
> still a braindamaged assumption.
>   
just idea: maybe this should be handled by environment variable. If I
set up:
export TESSDATA_PREFIX=~/.
tesseract will try to get ALL files from "$HOME/.tessdata"

Problem is that if tesseract did not find all needed files (e.g. 
eng.traineddata) in $TESSDATA_PREFIX it stops... (e.g. it will not look at 
"standard" installation directories like /usr/share/tessdata or 
/usr/local/share/tessdata).

I tried to use "export TESSDATA_PREFIX=~/.:/usr/local/share/tessdata" but it 
did not worked (tesseract tried to open file 
"/home/zdeno/.:/usr/local/share/tessdatatessdata/eng.traineddata" that is not 
correct)


Zd.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Announcement: new version of pyTesseractTrainer available

2010-08-13 Thread Zdenko Podobný


Dňa 13.08.2010 23:19, Robert Komar wrote / napísal(a):
> On Fri, 13 Aug 2010, zdenko podobny wrote:
>
>> Because IFAIK nobody react on Catalin e-mail I offered him to create
>> project
>> to collect patches and possibly to solve known issues. Because of my low
>> time resource project is looking still for owner/contributors. Warmly
>> welcomed are expect for python (multi-platform) GUI (GTK/QT/wx...)
>> because performance issues - on Windows XP (2GB memory) script crash or
>> freezes during opening file with a lot of boxes/symbols (e.g.
>> eng.arial.g4.tif), on Mandrivalinux 2010.164 bit (6GB memory) it take to
>> open&display 15 minutes!
>
> 15 minutes! You need to do some profiling on your code to see
> where it's spending all its time.
>
> http://docs.python.org/library/profile.html
>
I did not identify problem in "algorithm" part of code for moment ;-). I
see problem in "display" (pyGTK) part of code. Script creates gtk.entry
for each box and pack it to hbox container. So in case of
eng.arial.g4.box file it creates 4968 ui elements for boxes + number of
gtk.labels for spaces between words/group of symbols. I am not sure if
there is any ui that can handle such amount of elements in reasonable
time with reasonable resources. That why project needs some expert in
GUI to suggest more efficient approach.

I also wonder if there is not issue on (my ;-) ) linux. When I try to
open A5 image scan with 1627 boxes on Windows it is displayed with in
few seconds... But on linux it took 1 min 45 sec... But this is more for
discussion on http://groups.google.com/group/pytesseracttrainer-users ;-)

Just take this as warning if you are end-users or challenge if you are
programmer :-)

Zd.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Announcement: new version of pyTesseractTrainer available

2010-08-13 Thread Zdenko Podobný


Dňa 14.08.2010 00:17, Jimmy O'Regan wrote / napísal(a):
> On 13 August 2010 21:54, zdenko podobny  wrote:
>   
>> Hello,
>> I would like to announce new version 1.01 of pyTesseractTrainer - successor
>> of tesseractTrainer.py Version 1.00 is identical with tesseractTrainer.py.
>> Features:
>>
>> visual editor of box file
>> layout of symbol from box file reflect symbols on image
>> possibility to define bold, italic, and underline font
>> deleting, joining, splitting of symbols/boxes
>> easy and exact way of adjusting boxes
>> support for opening different image formats (tiff, png, jpeg, bmp, gif)
>> multi-platform support (tested on Linux 64 bit and Windows XP)
>>
>> Buxfixes (in 1.01):
>>
>> unicode support
>> 
> Ooh. No mean feat, 'cause Python sucks at Unicode :)
>
>   
>> opening of tesseract v3.00 box file (but save support only v2.0x box file)
>> identify/imagick is not need anymore
>> correct error that block to open file on Windows
>> solved issues regarding training symbols @ and $ (used also to identify bold
>> and italic font)
>> workaround for missing Numeric support in PyGTK
>>
>> Because IFAIK nobody react on Catalin e-mail I offered him to create project
>> to collect patches and possibly to solve known issues. Because of my low
>> time resource project is looking still for owner/contributors. Warmly
>> 
> I would recommend creating a project somewhere that offers distributed
> VCS support, that way you don't have the 'owner goes missing, no-one
> can commit problem'.
>
> As it's written in Python, Launchpad is probably the best place. The
> Ubuntu folks are big fans of Python, and it'll probably be relatively
> easy to recruit.
>
> On a related note, for anyone who likes Bazaar, there's a mirror of
> Tesseract's code on Launchpad. I'm not quite up to speed on bzr, but
> if someone sends me a link to a branch, I'll (figure out how to :)
> merge it to SVN.
>
>   
>> welcomed are expect for python (multi-platform) GUI (GTK/QT/wx...)
>>  because performance issues - on Windows XP (2GB memory) script crash or
>> freezes during opening file with a lot of boxes/symbols (e.g.
>> eng.arial.g4.tif), on Mandrivalinux 2010.164 bit (6GB memory) it take to
>> open&display 15 minutes!
>> 
> Ouch! I guess there's a lot of copying of image regions going on when
> all you really want is a reference. What's the graphics library? PIL?
>
>   
Script depends on python & pygtk only (no PIL, even it did not import
cairo :-) ).
At the moment I wanted to solve some issues of happy tesseractTrainer.py
users. So no ui changes additional features at the moment.

Zd.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training Problem (under Mac)

2010-09-06 Thread Zdenko Podobný


So details for training are split to 2 wikis:

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract2
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Unfortunately comments (now irrelevant) stay on 
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract ;-)


Zd.

Dňa 05.09.2010 12:01, zdenko podobny  wrote / napísal(a):
> Hello,
>
> Tesseract 2.04 do not use "combined" file, so there is no combine_tessdata.
> Just copy your files to tessdata directory.
>
> At the moment http://code.google.com/p/tesseract-ocr/wiki/TestingTesseract
> describe
> training for Tesseract 3.0 (with mistakes ;-) - I started to check it so
> soon there will be correct version). If you want to see description
> for Tesseract 2.04 look at svn repository
> http://code.google.com/p/tesseract-ocr/source/browse/wiki/TrainingTesseract.wiki?r=318.
> It is in wiki syntax but it is easy readable.
>
> BR,
>
> Zd.
>
> On Sat, Sep 4, 2010 at 5:15 AM, John Smith <4ever...@gmail.com> wrote:
>
>> Hi,
>>
>> Thank you so much for the reply.
>> I just have one more step to make, I am using Tesseract 2.04 now and I've
>> got all the files ready, I am trying to combine them all together but there
>> is no combine_tessdata for 2.04, I want to know how to combine them under
>> 2.04.
>>
>> Thank you so much!!
>>
>>
>> On Sun, Aug 29, 2010 at 8:30 PM, Jimmy O'Regan  wrote:
>>
>>> On 28 August 2010 07:45, OCR Newbie <4ever...@gmail.com> wrote:
 Hi All,

 Currently I am trying to use Tesseract(2.04) to recognize my own data,
 with Mac OS X Snow Leopard.
 I find this
>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
 and I am trying to follow this tutorial.
 My questions are:
 1. I already have my train.tif ready, but I am not sure where I should
 place the image file, (under 'tessdata' folder or can be anywhere?
>>> If you're running 'tesseract train.tif ...', it just needs to be in
>>> the current directory.
>>>
 2.About run the tesseract on my training image, it asks to run
 'tesseract train.tif train batch.nochop makebox' , I guess I should
 use the terminal, but when I type this command into it, it keep saying
 'tesseract command not found', I tried to run the configure terminal
 first and type 'make', but it is still not working.
>>> You also need to use 'make install', or provide a path to the
>>> executable - Unix-like systems (unlike DOS, etc.) do not include the
>>> current directory in the executable search path. (You can, of course,
>>> change that but it's A Bad Idea.)
>>>
>>> If tesseract is in /home/jim and $PWD (use 'echo $PWD') is /home/jim I
>>> could use:
>>> ./tesseract ...
>>> ('.' means 'this directory')
>>> /home/jim/tesseract
>>> (the full path)
>>> or even
>>> ../jim/tesseract
>>> ('..' means 'one level lower' - in this case, '/home')
>>> or even:
>>> $PWD/tesseract
>>>
>>> ($PWD is an environment variable, and will always be there... unless
>>> you remove it from another shell, but you probably don't need to worry
>>> about that).
>>>
>>> I think MacOS uses /User or something else, just substitute with
>>> actual values. Using 'make install' will be more convenient, though.
>>> --
>>>  jimregan, that's because deep inside you, you are evil.
>>>  Also not-so-deep inside you.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to tesseract-...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training japanese for 3.0

2010-09-19 Thread Zdenko Podobný

 Hi Stane,

why it doesn't look healthy? ;-)
There is one easy way how to find if it correct or not: to test it ;-)

BTW: when I searched for mistakes in former wiki (now corrections are 
included in
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) I
recognized that:
a) unicharset_extractor put NULL to type of script (maybe I did
something wrong, maybe google did not submit relevant code yet)
b) in unicharset.cpp there is code that works with these scripts: Latin,
Common, Greek, Cyrillic, Han, NULL
c) if you extract  unicharset files from some languages (e.g. 
"combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses
language file is from svn revision 309) you can find there also another
scripts: Hiragana and Katakana

I do not know if OCR result will be better if you replace NULL with
Latin, Common, Han etc. in unicharset file. If you have time please test
it and send info to this forum.

Zd.

Dňa 18.09.2010 13:14, Stane  wrote / napísal(a):
> Hi folks,
>
> I try to make my own jpn.traineddata for tesseract 3.0 and for the
> beginning with just 10 diffrent Characters/Kanjis which repeates
> theirself a few times and are seperates by a space to make sure they
> get boxed.
>
> With tesseract I create the box file, edit it with pytesseracttrainer
> to make everything nice and correct.
> Next i let run tesseract in training-mode to get a .tr file. So far so
> good and every things seems to be correct.
> But when i run the unicharset_extractor I get an unicharset which
> looks like this
> "10
> NULL 0 NULL 0
> 亜 0 NULL 0
> ..."
>
> Well this doesnt look soo healthy to me, I wonder if it is suposed to
> be like this and what did I wrong? Have I to create the unicharset for
> japanese manualy?
>
> Thanks for any help :-)
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training japanese for 3.0

2010-09-19 Thread Zdenko Podobný


Dňa 19.09.2010 16:01, Jimmy O'Regan wrote / napísal(a):
> 2010/9/19 Zdenko Podobný :
>> Hi Stane,
>>
>> why it doesn't look healthy? ;-)
>> There is one easy way how to find if it correct or not: to test it ;-)
>>
>> BTW: when I searched for mistakes in former wiki (now corrections are
>> included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3)
>> I recognized that:
>> a) unicharset_extractor put NULL to type of script (maybe I did something
>> wrong, maybe google did not submit relevant code yet)
> Probably the latter. There are, for example, function prototypes for a
> whole other OCR engine (called 'Cube', IIRC), for which there's no
> matching code.
>
Do you have info when (if) they plan to submit new code?
>> b) in unicharset.cpp there is code that works with these scripts: Latin,
>> Common, Greek, Cyrillic, Han, NULL
> There are more than that. For one, Fraktur is considered a script of its own.
>
Thanks for info. I expected that everything related to script is in
unicharset.cpp. Other scripts are in osdetect.cpp (if somebody is
interested).
>> c) if you extract  unicharset files from some languages (e.g.
>> "combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses language
>> file is from svn revision 309) you can find there also another scripts:
>> Hiragana and Katakana
>>
> Yes, those are mentioned in part of the code. What /seems/ to be there
> is an image-based script detection mechanism (the usual mechanism is
> to guess the script based on the types of mistakes) but I haven't seen
> it used.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Help on training tesseract for new language

2010-09-28 Thread Zdenko Podobný

 You mixed tesseract version: "combine_tessdata" command is part of
tesseract 3.00.  Tesseract-2.04 did not use .traineddata

Zd.

Dn(a 28.09.2010 17:25, TesseractNoob  wrote / napísal(a):
> I need to train tesseract for a new language. So I was successfully
> created the 8 files without any issues. After that using the
> "combine_tessdata" command I generated the .traineddata file. However
> when I run the OCR engine with the "/Projects/Training/trainfile.tif
> dilshan -l tes" command it generated the following error.
>
>
> Tesseract Open Source OCR Engine with LibTiff
> Segmentation fault
>
>
> Let me know the reason for this and is there any way to ensure that I
> have followed the correct steeps. Is there any log file to look into?
> I am using tesseract-2.04.
>
> Thank you.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Provide/visualize baseline info?

2011-02-05 Thread Zdenko Podobný

I am not sure what you if it helps you, but did you try debug mode 
(http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging)?


Zd.


Dn(a 05.02.2011 01:33, daemon-s  wrote / napísal(a):

Hi!

I train Tess using separate images for every text line. Recognition is
also ran over single text line images. Recognition performs pretty
well, however there are many errors that, I believe, related to
misdetected baselines, during training or recognition - I don't know.
These include:

" (double quote) detected as n
S detected as s (and vice versa)
V detected as v (and vice versa)
etc.

Is there any (preferably high-level) way to provide Tess with baseline
info? Or at least obtain baseline info from Tess in order to visualize
it further for debugging?

Thanks,
Dmitry



--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Trying to OCR a LCD Display

2011-02-25 Thread Zdenko Podobný


Can you post somewhere example?

Zdenko

Dňa 18.02.2011 20:29, GigaGuy wrote / napísal(a):

Can someone explain to me how I can train tesseract to recognized the
numbers on an lcd display?  They are terminal font.  I have a weather
station and a webcam.  I am taking a pic and trying to ocr it to put
into a db.  So I need to get a few specific parts of the image and ocr
them. But tesseract cant seem to recognize them.

Thanks.




--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract and Windows 7 64 Bit

2011-02-27 Thread Zdenko Podobný



Dn(a 26.02.2011 23:24, andy_syme  wrote / napísal(a):

Tesseract 3 doesn't seem to work on my win7 64 bit laptop and I think
I read somewhere that tesseract 3 doesn't work with 64 bit.

Please be more specific: What does it mean "doesn't seem to work"? Where 
did read that it does not work on 64bit?


I run tesseract on 64 bit linux without problem (2.04, 3.00 version). So 
only problem could be that it has to be compiled on/for 64bit. Current 
3.00 windows files are compiled and tested on Windows XP SP3  (32bit) in 
VC++ 2008. I created them and AFAIK there is only report that in 
requires Microsoft Visual C++ 2008 *SP1* Redistributable Package (x86) 
[1] (old version is not sufficient).
Unfortunately I have no access to Windows 7 64bit to test if there are 
some issues regarding 64bit on 
Windows.

So does tesseract work on 64 bit (if so what am I doing wrong)?  If
not are there any plans to port it onto 64 bit?



BR,

Zdenko

[1] 
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=a5c84275-3b97-4ab7-a40d-3802b2af5fc2&displaylang=en 
 


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesserac 2.0 not working

2011-03-06 Thread Zdenko Podobný


Hi,

I am not familiar with your code/application, so just hint: it sound me 
like tesseract is not able to find English language data in 
"D:\Projects\AMCDF\Source\Frameworks\Device\AMCDF.Device.GUI\Resources\tessdata\"


AFAIK tessnet2 is based on tesseract 2.0x so you need there 2.0x 
language data [1]. If your language data files are located somewhere 
else, than you need to set TESSDATA_PREFIX [2]. Installer of 
tesseract-ocr 3.00 set TESSDATA_PREFIX to its location, so this could be 
source of your problems.


Best regards,

Zdenko

Dňa 04.03.2011 07:15, Pieter wrote / napísal(a):

Hi, I added the code to my project from this site:
http://www.pixel-technology.com/freeware/tessnet2/
It work well untill i installed the tesserac 3.0 windows executable.
Now when I run my application it shuts down when it hits this line of
code:
  ocr.Init(@"D:\Projects\AMCDF\Source\Frameworks\Device\AMCDF.Device.GUI
\Resources\tessdata\", "eng", false);

This use to work. Any help?
I uninstaledd all references to 3.0 from regedit and still no luck



--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: can't read frequent_words_list file

2011-03-06 Thread Zdenko Podobný


Hello,

the message is clear: it can not find file "frequent_words_list". Does 
it exists? can sent listing of directory where you run 'wordlist2dawg'?


Zdenko

Dňa 06.03.2011 14:02, Sang Đặng Minh  wrote / napísal(a):

I using tesseract 2.0.4, and I creat 2 UTF-8 text file, then and then
use wordlist2dawg to make the DAWG files:
-wordlist2dawg frequent_words_list freq-dawg
-wordlist2dawg words_list word-dawg
The error message is: Could not open file: frequent_words_list.
Please help me!
tks u!

On Mar 4, 8:06 am, zdenko podobny  wrote:

please provide more information: how you try create dictionary, platform,
exact version of Tessaract (maybe how did you get it).

Zdenko

On Fri, Mar 4, 2011 at 2:50 PM, Sang Đặng Minh
wrote:








hi all. my name is Sang. I'm trying to train Tessaract 2.0, everything
is ok, but i can't create DAWG files, this error is: Could not open
file frequent_words_list.
Please help me!
thanks a lot!
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: tesseract-3.01 compiling issue on linux

2011-04-07 Thread Zdenko Podobný


Did you run "./runautoconf ; ./configure"  before running make?

I have no problem to compile revision 581 on linux.

Zdenko

Dn(a 07.04.2011 01:56, zl2k  wrote / napísal(a):

hi, all,

I just checked out tesseract-3.01(r581) from svn but got the following
compiling error on linux box

colfind.cpp:449: error: âboxaGetCountâ was not declared in this scope
colfind.cpp:451: error: âl_int32â was not declared in this scope
colfind.cpp:451: error: expected â;â before âxâ
colfind.cpp:452: error: âxâ was not declared in this scope
colfind.cpp:452: error: âyâ was not declared in this scope
colfind.cpp:452: error: âwidthâ was not declared in this scope
colfind.cpp:452: error: âheightâ was not declared in this scope
colfind.cpp:452: error: âboxaGetBoxGeometryâ was not declared in this
scope
colfind.cpp:453: error: âL_CLONEâ was not declared in this scope
colfind.cpp:453: error: âpixaGetPixâ was not declared in this scope
colfind.cpp:456: error: âpixGetWidthâ was not declared in this scope
colfind.cpp:494: error: âpixDestroyâ was not declared in this scope

Does anyone have a compilable version or if there is any by pass?
Thanks for help.

zl2k



--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: How to improve an existing language?

2011-04-13 Thread Zdenko Podobný


Hi,

deu-frak should be community language pack :-) prepared by the piggy 
(see [1] or [2]) so further improvements should be possible.


At the moment I can not find fraktur.tgz (in Google Group files - maybe 
it was removed), but there where other people interesting in its 
improving (see [3]). So if somebody has a copy they should make project 
for further improvements ;-)


Also there are other people working on "-frak" versions ;-)

[1] 
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/d68bef484614955d

[2] http://code.google.com/p/tesseract-ocr/issues/detail?id=62
[3] 
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/75b28445f188083f/5f6ff7eab504e504?hl=en%05f6ff7eab504e504


Zdenko

Dn(a 12.04.2011 22:09, stinguin  wrote / napísal(a):

Hi list,

I'm new to tesseract and hope that anyone of you could help me. I want
to ocr some german texts which are typesetted in fraktur. The results
by using the existing language "deu-frak" are good, but not good
enough. Is it possible to improve this language by training? If so,
can someone explain that step by step?
I just know how to create a new language. Do you think i can improve
the results by creating my own one? I think the deu-frak-language is
more than just a few box files, isn't it?

Thanks in advance



--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: How to improve an existing language?

2011-04-13 Thread Zdenko Podobný



Dňa 13.04.2011 19:32, Jimmy O'Regan wrote / napísal(a):

2011/4/13 Zdenko Podobný:

Hi,

deu-frak should be community language pack :-) prepared by the piggy (see
[1] or [2]) so further improvements should be possible.

At the moment I can not find fraktur.tgz (in Google Group files - maybe it
was removed), but there where other people interesting in its improving (see
[3]). So if somebody has a copy they should make project for further
improvements ;-)

Is this what you're looking for?
http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-2.01.deu-f.tar.gz

Yes thanks! As far as I check it is part of Peter project 
https://github.com/paalberti/tesseract-dan-fraktur/tree/master/deu-frak


Zdenko

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: boxtiff file for arabic

2011-04-21 Thread Zdenko Podobný


There were several such requests for tesseract 3.00 but without result ;-)
In past somebody tried to train tesseract based on released box+tiff 
files ([1], boxtiff-2.01-*.tar.gz [2]) and he got different result that 
published ;-)


So I think it does not make sense to wait for box+tiff files... I do not 
think google will release it (fully). This is space where community 
should be active.


[1] 
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Tif/Box_pairs_provided!

[2] http://code.google.com/p/tesseract-ocr/downloads/list

Zdenko

Dn(a 21.04.2011 08:12,   wrote / napísal(a):

Ok, sorry for inconvenience... I read that.
However, when I run tesseract with "-l ara", it can do something and
there is a ara.traineddata file.
So there must be some box files, tiff files, which I am asking
about...

Thanks...



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Catalan language

2011-05-15 Thread Zdenko Podobný



Dňa 11.05.2011 23:18, jinglada wrote / napísal(a):


On May 11, 9:53 pm, zdenko podobny  wrote:

On Wed, May 11, 2011 at 9:22 PM, jinglada  wrote:

In the /usr/share/tesseract-ocr/tessdata I have the following files:
cat.DangAmbigs  spa.DangAmbigs  eng.DangAmbigs  fra.DangAmbigs
por.DangAmbigs
cat.freq-dawg   spa.freq-dawg   eng.freq-dawg   fra.freq-dawg
por.freq-dawg
cat.inttemp spa.inttemp eng.inttemp fra.inttemp
por.inttemp
cat.normproto   spa.normproto   eng.normproto   fra.normproto
por.normproto
cat.pffmtable   spa.pffmtable   eng.pffmtable   fra.pffmtable
por.pffmtable
cat.unicharset  spa.unicharset  eng.unicharset  fra.unicharset
por.unicharset
cat.user-words  spa.user-words  eng.user-words  fra.user-words  por.user-
words
cat.word-dawg   spa.word-dawg   eng.word-dawg   fra.word-dawg
por.word-dawg
but the program only shows Portuguese, Spanish, French, English
Which program? What version you try to use? Where it show?

My system is Ubuntu 9.10 - the  Karmik Koala

In the Applications/Graphics menu the name is 'Tesseract-GUI' and the
command is 'python2.5 /home/joan/tesseract-gui-2.1/tesseract-gui.py'
and where you can elect the OCR language appears only the
possibilities: Portuguese, Spanish, French, English

Well, you should asked tesseract-gui author ;-) 
(http://tesseract-gui.sourceforge.net/)
Anyway it looks like you use tesseract 2.0x. Maybe you should look for other 
frontends that also consider tesseract 3.00 (vietocr, lector,PDF OCR X)...

Zdenko


What I have to do to activate Catalan (cat.) language?

tesseract do not need to activate language


Thanks in advance.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en




--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Training procedure

2011-06-21 Thread Zdenko Podobný

As far as I know tesseract is developed (or at least tested) on Ubuntu 
:-). Windows version is port ;-)


BTW: this is a stupid bug/feature: you can fix it by renaming file 
'spa.cour.g4.tr' to 'spa.cour.exp4.tr'. See comment in source code [1]. 
This worked for tesseract 3.01 (revision ) on Mandrivalinux 64bit (I do 
not use 3.00 anymore on linux)


[1] 
http://code.google.com/p/tesseract-ocr/source/browse/trunk/training/mftraining.cpp#266


Zdenko

Dn(a 21.06.2011 16:04, Esteban Bordón  wrote / napísal(a):

2011/6/21 zdenko podobny


PS: it worked on windows XP with tesseract 3.00



It's true. I've tested on Win XP and it worked.

żTesseract was tested on Linux Based operating systems?

regards,
Esteban.



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Training procedure

2011-06-21 Thread Zdenko Podobný

I know there is one bug in 3.00 (already fixed in svn for 3.01 version) 
that "works" on linux but not windows [1]. patch is included in that 
issue if needed also with explanation why it has problem on Linux/Mac 
and not Windows.


If possible I suggest to use recent revision of source code (589) a.k.a 
3.01 :-)


[1] 
http://code.google.com/p/tesseract-ocr/issues/detail?id=385&q=scientific 



Zdenko

Dn(a 21.06.2011 17:54, Dmitri Silaev  wrote / napísal(a):

Curious. It's not the first time I see platform-related discrepancies
in Tesseract's results. Nice to find out the root of it... Don't have
time to conduct a full-blown research, though. Anybody knows anything?

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Tue, Jun 21, 2011 at 10:04 AM, Esteban Bordón  wrote:

2011/6/21 zdenko podobny

PS: it worked on windows XP with tesseract 3.00

It's true. I've tested on Win XP and it worked.

żTesseract was tested on Linux Based operating systems?

regards,
Esteban.



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Can't make box files

2011-06-21 Thread Zdenko Podobný


Hi

I run:
/usr/local/bin/tesseract tesseract-ocr/eurotext.tif 
tesseract-ocr/eurotext tesseract-ocr/tessdata/tessconfigs/batch.nochop 
tesseract-ocr/tessdata/configs/makebox


and it created eurotext.box to tesseract-ocr/ (to the same directory 
where image is located).


If you have problem to find on linux which (or where) file is opened  by 
program you can use tool 'strace'. E.g.
strace ./tesseract ~/fontname/eng.FONTNAME.exp0.tif ~/fontname/ 
eng.FONTNAME.exp0 ../tessdata/tessconfigs/batch.nochop ../tessdata/ 
configs/makebox 2>&1 | grep open


Zdenko

Dn(a 13.06.2011 07:07, Mike Reed  wrote / napísal(a):

Ok, I managed to get past this error. It looks like I had a couple
things wrong. I didn't have libtiff in place, and my tiff wasn't
actually uncompressed, in spite of what my scanner software tells me.

I got the command line to run, but I can't find the .box file
anywhere. Where do they go?

Thanks,
Mike

On Jun 11, 2:21 pm, Mike Reed  wrote:

I am running on Ubuntu 11.04 and am getting an error when I try to
make a box file for a new font:
actual_tessdata_num_entries<= TESSDATA_NUM_ENTRIES:Error:Assert
failed: in file tessdatamanager.cpp, line 55

This is the same error when I give it garbage for the .tif file name.
Here's my command line:

./tesseract ~/fontname/eng.FONTNAME.exp0.tif ~/fontname/
eng.FONTNAME.exp0 ../tessdata/tessconfigs/batch.nochop ../tessdata/
configs/makebox

This is run from the tesseract-3.00/api directory.

I have successfully run make and make install on liblept and
tesseract. The .tif file is uncompressed, and the .exp0 file is UTF-8.

Any help to make a box file is greatly appreciated. I'm wondering if
there's a make install problem somehow that is causing me to have to
path out tesseract, nochop, and makebox -- and if that might be
causing other problems.

-Mike


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: read other languages by tesseract on c #

2011-10-07 Thread Zdenko Podobný


If there is error message:
Unable to load unicharset file C:\Program 
Files\Tesseract-OCR\ita.unicharset"


than it means that your program expect language files (ita.*) in  
directory "C:\Program Files\Tesseract-OCR\" and not in "...\tessdata"


Zdenko

Dňa 07.10.2011 18:20, Alessandro Latella  wrote / napísal(a):

The error is "Unable to load unicharset file C:\Program Files
\Tesseract-OCR\ita.unicharset", but in the directory  ...\tessdata
there are all the "ita." files.
Yes the image work correctly with tesseract.exe

On 5 Ott, 04:27, Quan Nguyen  wrote:

What's the error exactly? Does the image work with tesseract.exe?

On Oct 4, 5:02 am, Alessandro Latella  wrote:




Hi guys, I'm trying to run tesseract on c #.
The program works well on English language  'ocr.Init(@"C:\Program
Files\Tesseract-OCR\tessdata", "eng", false);'
If I try to change the language from "eng" to "ita", the program
generates an error and does not work.
I use the library tessnet2.dll .
Thanks,
Alessandro.- Nascondi testo citato

- Mostra testo citato -


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Stopping Tesseract title when using command line tool

2011-12-01 Thread Zdenko Podobný


Simple: no
Complicated:  redirect standard output (stdout)

BTW: it is displayed when tesseract is started and and not before 
results are put to file (results are not displayed).


If you do not like this behavior you can modify source and create your 
own version. Or you can use tesseract library (no problem on linux; for 
windows have a look to Developers forum [1].


[1] http://groups.google.com/group/tesseract-dev

Zdenko

Dn(a 01.12.2011 18:48, Bigglesuk  wrote / napísal(a):

Hello, I am using the Tesseract executable on the command line.  Is
there a way I can stop the output "Tesseract Open Source OCR Engine
v3.01 with Leptonica" being displayed before the result.

Many thanks



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tesseract Dll For Visual Basic Express 2008

2011-12-30 Thread Zdenko Podobný


http://groups.google.com/group/tesseract-dev/browse_thread/thread/75be5c97eb4d1b3c

Tom expect he  can finish it (add some doc + packaging)  after beginning 
of the year. But dll seems to work. :-) At least nobody reported 
something else.


Zdenko

Dn(a 28.12.2011 15:02, Lahiru Himash Madusanka  wrote / napísal(a):

Hi,

Does any one has Tesseract Dll that works with MS Visual Basic Express
2008. Please give me a copy. This is a emergency.

Thank you.



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Version 3.02 in alpha

2012-02-04 Thread Zdenko Podobný

You are not able to compile any c++ program on linux from source. This 
is our of tesseract scope to learn you how to compile source.

You should read first some manual how to compile program from source.

Zdenko

Dn(a 04.02.2012 08:31, Sriranga(78yrsold)  wrote / napísal(a):

Zenko,
Thanks for the valuable guidance. in fact I had followed
http://code.google.com/p/tesseract-ocr/wiki/TesseractSvnInstallation -which
leads to confusion. Now I followed as per your valuable guidance,
downloaded all required items as per readme
http://code.google.com/p/tesseract-ocr/wiki/ReadMe. tried to install in
ubuntu 11.10 but failed vide typescript or untitled cocument.Kindly
intimate me where I made mistake?

  I will test in WinXP now.

With warmest Regards,
-Sriranga(79yrs)

On Fri, Feb 3, 2012 at 10:40 PM, zdenko podobny  wrote:



On Fri, Feb 3, 2012 at 5:29 PM, Sriranga(78yrsold)<
withblessi...@gmail.com>  wrote:


zdenko,
Tried in ubuntu 11.10 - failed to install even after following the
guidelines in wiki.


No, you did not follow guidelines in wiki [1]. Try to read it first ;-)

[1] http://code.google.com/p/tesseract-ocr/wiki/ReadMe#Linux



In this connection attached typescript for your perusal and valuable
guidance. Where i made mistake may kindly be intimated to me.
With Warmest Regards,
-sriranga(79yrs)


On Fri, Feb 3, 2012 at 6:14 PM, Sriranga(78yrsold)<
withblessi...@gmail.com>  wrote:


Zdenko,
Thanks for the information. I don't have VS2008 in Linux but in
winXP(sp3) :-). Actually i downloaded from svn into ubuntu 11.10 and then
copied to winxp. Since there was file tesseract.sln in the folder "VS2008",
as such I tried- only 24 succeeded.  Now I shall wait for patches for
VS2008 are uploaded.
With Warmest Regards,
-sriranga(79yrs)


On Fri, Feb 3, 2012 at 6:03 PM, zdenko podobny  wrote:


Do you have VS2008 for linux ;-) (as Ray wrote
"currently Linux-only") ?

PS: I work on patches for VS2008, but there are some problems... I need
to made some additional tests...

Zdenko


On Fri, Feb 3, 2012 at 1:06 PM, Sriranga(78yrsold)<
withblessi...@gmail.com>  wrote:


When tried to generate exe files using VS2008 but failed. where exe
files will be stored? in bin or bin.dbg or training folder ?

On Fri, Feb 3, 2012 at 4:54 PM, Wil Haddenwrote:


Hi Ray,

Any idea of timescales when there will be a 3.02 package on the
downloads page of googlecode?

Or are there any release notes between 3.01 and 3.02, I'm, just a bit
wary of being bleeding edge :)

Wil

On Feb 2, 6:55 pm, Ray Smith  wrote:

Tesseract 3.02 is now available in svn for preliminary testing,

currently

Linux-only.

There are now 65 languages and some big improvements in layout

analysis and

character accuracy.
This version will with luck make it into Ubunto LTS Precise

Pangolin, so

please test to see if your favorite issue is resolved.

Thanks and enjoy!

Ray.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


  --
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


  --
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en




  --
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


  --
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess 3.02 English training set broken?

2012-02-05 Thread Zdenko Podobný

Can you please provide more details (OS, compiler, how to run/use 
tesseract)?


Zdenko

Dn(a 05.02.2012 15:38, patrickq  wrote / napísal(a):

I am running the latest Tess 3.02 with the new English training set
and get the following crash at init with lang:

actual_tessdata_num_entries_<= TESSDATA_NUM_ENTRIES:Error:Assert
failed:in file tessdatamanager.cpp, line 48

Has anyone seen this?

Note: I am not using the cube version, just "eng" with eng.traineddata

By the way: I noticed the new training set is 21.9MB versus 3.1MB for
Tesseract 3.01: just more fonts added or something else too?

Thanks,
Patrick



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Decreased accuracy after training for specific characters

2012-02-12 Thread Zdenko Podobný


Hi Chris,

I have the same experience - that leads me to conclusion it does not 
make sense to train "common" fonts...
I think google use different process  (more detailed; more/other tools?) 
comparing to information available on wiki... IMHO situation is 
improving with each release, so I wait for additional information 
regarding 3.02 training.


On other hand there is place for community to train "non-standard" fonts 
(e.g. in my case fraktur). I planned to write blog about my experience 
when I helped to Slovak version of project Gutenberg, but there is 
always something more urgent... ;-)


Zdenko

Dn(a 11.02.2012 14:47, Chris  wrote / napísal(a):

I also tried training with all the data. I seem to have the same
problem with accuracy being much less than what you get with the
default one.

One thing that looks a bit off is my unicharset file contains lots of
NULLS and contents doesn't seem to match the documentation on doing
training:

108
NULL 0 NULL 0
t 3 0,255,0,255 NULL 41 # t [74 ]a
h 3 0,255,0,255 NULL 81 # h [68 ]a
a 3 0,255,0,255 NULL 57 # a [61 ]a
n 3 0,255,0,255 NULL 14 # n [6e ]a
P 5 0,255,0,255 NULL 30 # P [50 ]A
o 3 0,255,0,255 NULL 25 # o [6f ]a
e 3 0,255,0,255 NULL 58 # e [65 ]a
: 10 0,255,0,255 NULL 8 # : [3a ]p
r 3 0,255,0,255 NULL 52 # r [72 ]a
etc...

Also when combining the files I get this output:

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 3961
Offset for type 4 is 701702
Offset for type 5 is 702267
Offset for type 6 is -1
Offset for type 7 is 716918
Offset for type 8 is -1
Offset for type 9 is 717216
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1

So I obviously don't have all the necessary files. Would this effect
accuracy when recognising single characters?


On Feb 11, 10:17 am, Chris  wrote:

Hi All,

I'm using tesseract quite successfully in my code. I have a
preprocessing step that locate the characters I need to recognise and
then I feed them into tesseract using the PSM_SINGLE_CHAR mode.

This works great with the default eng.traineddata

I'm also constraining the tessedit_char_whitelist to just have numbers
and upper case letters as that is the only thing I have in my
character set.

I want to reduce the size of my app and the traineddata is by far the
largest chunk of data at the moment.

What I've tried to do is retrain tesseract so that it only has the
characters I need in the training data. I've done this successfully,
but when I use my newly created eng.traineddata the accuracy is much
worse than if I use the default eng.traineddata.

Any ideas why this should be? I thought if anything that accuracy
would improve if I'd removed all the unnecessary characters from the
data.

I'm doing my training by taking the box files and stripping out all
the characters I don't need and then running through the training
instructions.

I'm using tesseract3.01

Any thoughts?

Cheers
Chris.


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Error during using tesseract-ocr

2012-03-06 Thread Zdenko Podobný

This is leptonica error message that indicate problem with image support.
If you have version from today, please send output of:
tesseract -v

Zdenko

Dňa 06.03.2012 15:42, Ivan Mushketik wrote / napísal(a):
> Hello.
>
> I tried to run tesseract-ocr v 3.02 with the following params:
> tesseract phototest.tif output
>
> but received:
> Tesseract Open Source OCR Engine v3.02 with Leptonica
> Error in findTiffCompression: function not present
> Error in pixReadStreamTiff: function not present
> Error in pixReadStream: tiff: no pix returned
> Error in pixRead: pix not read
> Unsupported image type.
>
> I also tried to use png file, but received similar output.
>
> Leptonica and libtiff are libraries installed.
> How can I fix this problem?
>
> I am using Ubuntu 11.10
>
> Thank you in advance.
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: .net 3 Confidence is always 0

2012-03-16 Thread Zdenko Podobný

Dňa 14.03.2012 19:49, Curtis wrote / napísal(a):
> I am using the vs 3 .net wrapper.
> When I run the function Recognize it ocrs the image fine and I can get
> the string.
> I need the confidence level of each character, but it is always 0.
> What am I doing wrong?
>
>
>
> Dim image As New Bitmap("C:\MyImage.tif")
> Dim ocr As New TesseractProcessor
>
> ocr.Init(Nothing, "eng", False)
> Console.WriteLine(ocr.Recognize(image))
>
>
> ocr.InitForAnalysePage()
> ocr.SetVariable("tessedit_thresholding_method", "1")
> ocr.SetVariable("save_best_choices", "T")
>
>
> Dim doc As DocumentLayout = ocr.AnalyseLayout(image)
> For Each blk As OCR.TesseractWrapper.Block In doc.Blocks
> Console.WriteLine("Block Confidence: " & blk.Confidence)
>
>
> For Each para As Paragraph In blk.Paragraphs
> Console.WriteLine("para Confidence: " &
> para.Confidence)
>
> For Each ln As TextLine In para.Lines
> Console.WriteLine("ln Confidence: " &
> ln.Confidence)
>
> For Each wrd As Word In ln.Words
> Console.WriteLine("wrd Confidence: " &
> wrd.Confidence)
> Console.WriteLine("wrd Text: " & wrd.Text)
>
> For Each ch As Character In wrd.CharList
> Console.WriteLine("V:" & ch.Value)
> Console.WriteLine("C:" & ch.Confidence)
> Next
>
> Next
>
> Next
> Next
> Next
>
Hi,

I am not familiar with .net so I can not help you directly.

It looks like that .net wrapper was not updated for quite a long time
(revision 590 without 3.01 code)...
Anyway if somebody interesting in char confidence he can try to use (in
c++) GetComponentImages&tesseract::RIL_SYMBOL +
PageSegMode&tesseract::PSM_SINGLE_CHAR. Simple test file attached.
Tested in 3.02 (in svn) code.

Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
/*
  compile:
  $ g++ test_confidence.cpp -I/usr/local/include/tesseract/ -I/usr/include/leptonica/ \
-ltesseract -llept -o test_confidence
  run:
  $ ./test_confidence
*/

#include 
#include 

int main() {
Pix *image;
BOX *box;
l_int32 i, nwords;
char* outText;

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}

image = pixRead("/usr/src/tesseract-3.02/phototest.tif");
api->SetImage(image);

// split image to symbols
Boxa* boxes =  api->GetComponentImages(tesseract::RIL_SYMBOL, true,
   NULL, NULL);
api->SetPageSegMode(tesseract::PSM_SINGLE_CHAR);

nwords = boxaGetCount(boxes);
printf("Boxa count: %d\n", nwords);
for (i = 0; i < nwords; i++) {
box = boxaGetBox(boxes, i, L_CLONE);
api->SetRectangle(box->x, box->y, box->w, box->h);
outText = api->GetUTF8Text();
// remove "\n" from outText
outText[strcspn(outText, "\n")] = '\0';
int conf = api->MeanTextConf();
printf("Box[%d]: x=%d, y=%d, string='%s', confidence: %d\n",
   i, box->x, box->y, outText, conf);
}

api->Clear();
api->End();
delete [] outText;
pixDestroy(&image);
return 0;
}

Re: Version 3.02 in alpha

2012-03-18 Thread Zdenko Podobný

Hi,

you did give all details, so I need to guess some details:

1. I guess that you run something like this:
  $  tesseract binarized.jpg content -l deu
but you created makebox file with command
  $  tesseract binarized.jpg binarized makebox
if yes, than difference is in used language file

2. I try to run OCR eng and than with deu language file. With eng url
was ok (see binarized-eng), but some German words were not correct. It
look like "problem" is in German language file (dictionary?) and not in
tesseract library. This is just quick option, so maybe I am wrong. As a
workaround you can combine English and German file in tesseract3.02 (see
result  binarized-eng_deu.txt)
  $  tesseract binarized.jpg binarized-eng_deu -l eng+deu

Zdenko

Dňa 17.03.2012 21:40, Renard Wellnitz  wrote / napísal(a):
> Hi all,
>
> first of all i would like to express my heartfelt thanks for this great 
> piece of software which tesseract is. :-)
>
> Right now i am currently making an OCR Android App  with tesseract and the 
> results i got so far are very good.
>
> But i encountered a strange issue with tesseract 3.01 and also 3.02.
> When running tesseract on the supplied file, tesseract fails to correctly 
> recognize some characters. Especially in line 8 it gives "wwwxegio-bahnde" 
> instead of "www.regio-bahn.de"
> I then ran the makebox command to see what was going on. To my surprise if 
> found that the boxes and characters where all 100% correct!
> I guess there is no easy fix or config value that i can experiment with?
>
> Cheers
> Renard
>
>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>
> Am Donnerstag, 2. Februar 2012 19:55:57 UTC+1 schrieb Ray Smith:
>> Tesseract 3.02 is now available in svn for preliminary testing, currently 
>> Linux-only.
>>
>> There are now 65 languages and some big improvements in layout analysis 
>> and character accuracy.
>> This version will with luck make it into Ubunto LTS Precise Pangolin, so 
>> please test to see if your favorite issue is resolved.
>>
>> Thanks and enjoy!
>>
>> Ray.
>>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
Haben Sie

Re: tesseract under windows and paths

2012-03-22 Thread Zdenko Podobný

Hi Simon,

I implemented "--disable-tessdata-prefix" for configure in revision 708.
Than means if you build tesseract with this option, TESSDATA_PREFIX is
not set during build process to installation directory (usually
/usr/share or /use/local/share on linux).

I tested it in mingw+msys on Windows XP (more tests are need from mingw
users/developers ;-)). When I run tesseract (/usr/bin/tesseract) it
expected to have "language data"/"tessdata directory" in directory where
is placed tesseract executable (in my case:
/usr/bin/tessdata/eng.traineddata).

Zdenko

Dňa 23.02.2012 13:07, simon.eigeldin...@vol.at wrote / napísal(a):
> hi zdenko,
>
> thanks. i found my problem. i had a variable from a setup program
> which used tesseract and it had the variable set wrong.
> removed it now and it works well now.
>
> now about compiling tesseract:
> when i specify a path to the tessdata dir during compiling can i tell
> it to use a relative path to the program executable for example
> --tessdataprefix=tessdata
> I guess then it might look in the subdir of the executable and it
> should work?
>
> greetings,
> simon
>
> On Thu, 23 Feb 2012 11:56:28 +0100
> zdenko podobny  wrote:
>> simon,
>>
>> you did get the point - if environment variable TESSDATA_PREFIX is set,
>> than it rule over other rules (for tesseract executable)! If
>> enviroment variable is not setup, than it check if TESSDATA_PREFIX was
>> defined during compilation (this should be true for platform that use
>> autotools e.g. cygwin). If TESSDATA_PREFIX was not defined and there is
>> no environment variable TESSDATA_PREFIX than path of
>> executable/library is
>> consider as TESSDATA_PREFIX. See [1].
>>
>> If you need portable version (in term how you present it), just download
>> tesseract-ocr-3.01-win32-portable.zip
>>
>> that
>> works exactly as you described (anyway TESSDATA_PREFIX Environment
>> variable
>> overrules everything ) .It is a static build.
>>
>> Zdenko
>>
>> [1]
>> http://code.google.com/p/tesseract-ocr/source/browse/trunk/ccutil/mainblk.cpp#56
>>
>>
>>
>> On Thu, Feb 23, 2012 at 9:03 AM,  wrote:
>>
>>> hi zdenko,
>>>
>>> here on a german windows its:
>>> C:\Programme\Tesseract-OCR\
>>>
>>> on a english windows it would be:
>>> C:\Program files\Tesseract-OCR\
>>>
>>>
>>> but i would recommend getting the path of the executable and going into
>>> the tessdata dir which makes it easier across windows systems and usb
>>> sticks and what not i guess.
>>>
>>>
>>> the program files dir is saved in the variable %programfiles% on
>>> windows
>>> which is autonmatically made available by the OS.
>>> But i wouldn't use that method cause of above reasons with USB
>>> sticks or
>>> different installations.
>>>
>>> greetings,
>>> simon
>>>
>>>
>>> On Thu, 23 Feb 2012 08:28:49 +0100
>>> zdenko podobny  wrote:
>>>
 can you sent result of:
 echo %TESSDATA_PREFIX%

 Zd.

 On Thu, Feb 23, 2012 at 7:59 AM,  wrote:

 Hi all,
>
> i successfully compiled tesseract svn r 679 under windows using
> cygwin
> and
> figured out that tesseract looks in the following directory for
> .traineddata files: %programfilesdir%\tesseract-ocr\tessdata.
>
> I would point that path to the working dir of the executable and
> then in
> the tessdata subdir. cause then it would be possible to copy
> tesseract
> for
> example on a USB stick and use it from there or copy it to a
> different
> directory without to change variables or other things.
>
> greetings,
> Simon
>
> -- 
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscribe@**go**oglegroups.com
> 
> 
>
> **>
>
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> http://groups.google.com/group/tesseract-ocr?hl=en>
>
> >
>
>
 -- 
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To post to this group, send email to tesseract-ocr@googlegroups.com
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscribe@**googlegroups.com

 For more options, visit this group at
 http://groups.google.com/**group/tesseract-ocr?hl=en


>>>
>>> -- 
>>> Simon Eigeldinger
>>> simon.eigeldin...@vol.at
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com
>>> To un

Re: Enchancing "half-toned" image for tesseract processing

2012-03-31 Thread Zdenko Podobný


Dn(a 31.03.2012 15:59, klo  wrote / napísal(a):
>
> I have a scanned PDF material to which I want to add hidden text layer, so 
> I could index the document. I used ghostscript black and white tiff output 
> device (tiffg4) to extract pages as tiff images, and here is example of 
> what they look like:
>
> 
>
> Processing this image with tesseract, does not give good results.
> Changing ghostscript output DPI (600, 300, 150, 96) shows that image at 96 
> DPI gives best result from tesseract but it's still not satisfactory.
>
> I then used 8-bit gray tiff output from ghostscript, instead 1-bit black 
> and white, and in this case at 150 DPI I got even better result then 
> previously with 96 DPI black and white. However still not there yet.
>
> Can someone suggest which filter could enhance this image so that I get 
> better results? I could use imagemagick, but also can use general imaging 
> filter from program language, so just name it if you know how.
>
>
> TIA
>
It is a difficult to suggest you the best strategy if you do not provide
input (pdf) and exact command how you run conversion. There are several
way/tools how to convert pdf to image [1],[2]...

[1]
http://virtualvoid.posterous.com/pdf-to-image-conversion-comparing-pdf-rendere
[2]
http://stackoverflow.com/questions/75500/best-way-to-convert-pdf-files-to-tiff-files#221341

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: How to instruct tesseract not to use ligatures (i.e. don't use ﬁ, ﬂ... instead fi, fl...)

2012-03-31 Thread Zdenko Podobný


Dňa 31.03.2012 16:17, klo  wrote / napísal(a):
> In my simple testing, I find this most common problem, is there a way to 
> instruct tesseract not to use those glyphs without limiting it to ASCII?
>
> I use tesseract 3.01 BTW
>
put them to blacklist with variable tessedit_char_blacklist (search
forum if you do not know how).

Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Different output for almost identical images

2012-04-06 Thread Zdenko Podobný

Dňa 06.04.2012 17:35, Rufus wrote / napísal(a):
> Thanks for the reply.
>
> I've tried another image(bad2.tiff), which is still a bit different from 
> good.tiff, and is of the same order regarding the compression ratio. 
> However, tesseract still doesn't output anything for bad2.tiff.
> I then tried to feed tesseract with only the first character, and there is 
> works for bad_char.tiff (from bad.tiff) but it doesn't work for 
> bad2_char.tiff (from bad2.tiff).
>
> Commands:
> tesseract bad_char.tiff bad_char -l eng -psm 10 nobatch digits
> tesseract bad2_char.tiff bad2_char -l eng -psm 10 nobatch digits
>
>
> All the images attached are actually thresholded. I guess there is not much 
> room for improvement there. I've also tried by training tesseract with a 
> new language consisting only of digits with a particular font (font: Impact 
>  looks like the font in the images). Do you also experience these 
> problems when using tesseract?
>
I think problem is with size of text, resolution and missing border. I
tried this:
convert -border 500 -resample 300 -density 300 -resize 50 bad2.tiff bad2.png
and
tesseract bad2.png bad2
produced results.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: segfaulting again (svn 3.02)

2012-04-12 Thread Zdenko Podobný


Dňa 12.04.2012 18:09, Falke wrote / napísal(a):
> i hope this posts in the right order
>
> addendum to my reply from 30 minutes ago:  I rebuilt the bad build.
> Didn't help; same error.
>
Does it mean it segfaults? Can you provide more info (OS, platform, how
you run OCR...)?

I tried it (tesseract segfault_trigger.jpg segfault_trigger) on openSUSE
12.1 (x86_64) and windows XP SP3 (32bit) and I have no problem.

Did you modified source? If yes, can you please test unmodified source?

Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tesseract Classification

2012-04-13 Thread Zdenko Podobný

2.01 is too old. So I would suggest to use tesseract executable only or
upgrade to tesseract 3.02.
Save your data as image (in format that is recognized by tesseract 2.01)
and run:
   tesseract image_file output_file

Zdenko
Dňa 13.04.2012 06:20, Ankur Rana  wrote / napísal(a):
> Please see the attached files. Image file contains mixture
> of Punjabi and English language text. I have made Punjabi recognition
> system. What i need is to recognize the English text. When i run tesseract
> on bmp image it recognized English text successfully.  "*224_1.dat" *is
> binary file of the image file (224_1.bmp) that i have created. I want to
> pass 224_1.dat file as a input to tesseract for recognition of English text.
>
> On Fri, Apr 13, 2012 at 12:05 AM, Mayur Mudigonda  recognize the engligo...@gmail.com> wrote:
>
>> Ankur, you need to be more specific. Please attach the image if possible
>> for other users on the thread to attempt to give you a solution. It is hard
>> to help without more information.
>>
>> M
>>
>>
>> On Thu, Apr 12, 2012 at 8:00 AM, Ankur Rana wrote:
>>
>>> I tired to pass only image binary data in text file to tesseract but not
>>> working. Can anybody explain how to tesseract read the image file?
>>>
>>>
>>> On Thu, Apr 12, 2012 at 12:45 PM, Mayur Mudigonda <
>>> mayur.mudigo...@gmail.com> wrote:
>>>
 If this is from the command line - save it as .png image file and call
 Tesseract

 It should be no different from any other image

 On Wed, Apr 11, 2012 at 11:24 PM, Ankur Rana wrote:

> how can i pass my already binarized image data to Tesseract 2.01?
>
>
> On Thu, Apr 12, 2012 at 9:55 AM, Mayur Mudigonda <
> mayur.mudigo...@gmail.com> wrote:
>
>> I think if you were writing your own classification code, you'd have
>> to edit the classification (cpp files) and compile them manually. You 
>> would
>> also branch from the default Tesseract build.
>>
>> I am interested in analyzing the use of more powerful classifiers like
>> LTSM NN and Boltzman machines based NNs. Although that is a more mammoth
>> task and I would require support from more people.
>>
>> M
>>
>>
>> On Wed, Apr 11, 2012 at 9:04 PM, Karin  wrote:
>>
>>> Can we do another classification on tesseract?
>>> Currently I am using Tesseract 2.00  and I go through all the
>>> variables that can be set before recognize. Is there a mechanism to
>>> add KNN for classification in Tesseract 2.00?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>
>>
>> --
>>
>> URL:
>> www.cse.msu.edu/~mudigon1 
>> www.blindsight.com/team
>> Elegance is not a dispensable luxury but a factor that decides between
>> success and failure.
>> Edsger Dijkstra
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>
> --
> Regards
>
> ---
> Ankur Rana
> (ਅੰਕੁਰ ਰਾਣਾ)
>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>


 --

 URL:
 www.cse.msu.edu/~mudigon1
 www.blindsight.com/team
 Elegance is not a dispensable luxury but a factor that decides between
 success and failure.
 Edsger Dijkstra

  --
 You received this message because you are subscribed to the Google
 Groups "tesseract-ocr" group.
 To post to this group, send email to tesseract-ocr@googlegroups.com
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en

>>>
>>>
>>> --
>>> Regards
>>>
>>> ---
>>> Ankur Rana
>>> (ਅੰਕੁਰ ਰਾਣਾ)
>>>
>>>  --

Re: Configuration / Documentation

2012-04-13 Thread Zdenko Podobný

Dn(a 13.04.2012 09:20, troplin  wrote / napísal(a):
> Hello,
>
> is there any documentation about config files and configuration variables?
> I am especially interested in a list of the most important/useful variables 
> from a user point of view.
>
> Regarding config files and API, is the "api_config" file still used or ist 
> that just a relict from version 2?
>
> troplin
>
As far as I know only documentation for variables is in a source code.
Tom showed in docs for VS2008[1] how to list them easily on Windows
(linux user will use find&grep).

I am not sure what are important/useful variables - it will depend on
circumstances (e.g. *_debug_* variables)

"api_config" file (tessdata/configs/api_config)[2] is regular config
file (e.g. useful if you run tesseract from command line), that set
variable tessedit_zero_rejection[3] to true.

Zdenko

[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/tools.html#id2
[2]
http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/api_config
[3]
http://zdenop.github.com/tesseract-doc/classtesseract_1_1_tesseract.html#a8ad03214a06d9531a0dae0a80207baaf

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Windows newline?

2012-04-25 Thread Zdenko Podobný

Dn(a 25.04.2012 14:06, Nonmaskable Interrupt  wrote / napísal(a):
> I just built 3.02 from svn using VS2008 and it seems to work fine, except
> that newline characters
> are Linux standard ('/n') instead of windows ('\r\n') standard.  This is a
> change from previous behavior;
> is it intentional?
>
It is intentional. The aim is provide the same result regardless
platform. The output (and text input in training) should be:

  * text the UTF-8 encoded file without BOM[1]
  * lines are separated with with new line char "\n"

[1]http://en.wikipedia.org/wiki/Byte_order_mark

Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: how to know the unlabelled blobs location!?

2012-05-03 Thread Zdenko Podobný


Dňa 02.05.2012 16:46, nkantan r wrote / napísal(a):
> hi all!
> find below the log on generating a tr file;
>
> 
> Page 0
> APPLY_BOXES:
>Boxes read from boxfile:3312
>Boxes failed resegmentation:   0
>Found 3312 good blobs and 3 unlabelled blobs in 0 words.
>0 remaining unlabelled words deleted.
> TRAINING ... Font name = TAMKambanNarrow
> Generated training data for 220 words
> Page 1
> APPLY_BOXES:
>Boxes read from boxfile:3312
>Boxes failed resegmentation:   0
>Found 3312 good blobs and 3 unlabelled blobs in 0 words.
>0 remaining unlabelled words deleted.
> Generated training data for 232 words
> 
>
> normally i get "0 unlabelled blobs in 0 words" and if i deliberately
> deleted any boxes i get "nn boxes in 0 words"; but in this particular
> tif and box files all orginally generated boxes are labelled (either
> individually or after merging or splitting); so no blob is left
> unlabelled; i went through the box/tif file using jTess box editor;
> but i could not locate any unlabelled blobs;
> is there a way to generate the box-coordinates in the log file so that
> i can definitely check that all boxes are covered?
>
> regards
> rnkantan
>
I am not sure if I understand you correctly. Do you need to visualize
(e.g. draw rectangle) base on this king of message [1]?

Or in your log file there in no such message ([1])?
It would be good to post you file somewhere for further testing...


[1] APPLY_BOXES: Unlabelled word at :Bounding box=(239,3113)->(396,3153)

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tesseract set accuracy as speed

2012-05-09 Thread Zdenko Podobný

I would expect "For English (and a few other languages)" = "that in svn"

--
Zdenko

Dňa 06.05.2012 13:40, Sriranga(78yrs) wrote / napísal(a):
> which are *few other* languages likely have cube module is trained as well
> as tesseract?Accucary is preferred to speed.
>
> On Sun, May 6, 2012 at 1:18 PM, David Eger  wrote:
>
>> We got rid of this particular function for the new release.
>>
>> For English (and a few other languages) the Cube module is trained as
>> well as Tesseract, and combining them may get you ~13% better results,
>> though it will be ~4x slower.  To run Tesseract in this mode, you can
>> pass tesseract::OEM_TESSERACT_CUBE_COMBINED  as the OcrEngineMode
>> option to TessBaseAPI::Init().
>>
>> -david
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to tesseract-ocr@googlegroups.com
>> To unsubscribe from this group, send email to
>> tesseract-ocr+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Error In Code

2012-05-17 Thread Zdenko Podobný

What kind of problem? IMO original reporter did not revealed what was
real problem... ;-)

--
Zdenko
Dňa 16.05.2012 21:24, Aaron Campos wrote / napísal(a):
> I have the same problem. how resolve it?
>
> El viernes, 6 de enero de 2012 11:31:42 UTC+1, Lahiru Himash Madusanka 
> escribió:
>> It has been solved Mr: Zdenko. Thank your For your help
>>
>> On 1/5/12, zdenko podobny  wrote:
>>> On Thu, Jan 5, 2012 at 1:31 PM, Lahiru Himash Madusanka <
>>> lahiru.lahirumadusa...@gmail.com> wrote:
>>>
 Will This code work in cmd. It doesn't gives me a output

>>> What does it mean "It doesn't gives me a output"?
>>> In cmd?
>>> In your program?
>>> It does not create output file?
>>> It creates empty file?
>>> Or something else?
>>>
>>> Provide details otherwise nobody will help you. Also you did not write
>>> if eurotext.tif or phototest.tif work for you or not...
>>> Do not try to use custom language file if English is not working in your
>>> program...
>>>
>>>
 "G:\Visual Studio 2008\Akshara Sinhala OCR\Akshara Sinhala
 OCR\bin\Debug\tesseract.exe" "C:\Documents and Settings\CR-PC-01\My
 Documents\My Pictures\untitled.JPG" "TEmp" -l sinhala

 On 1/2/12, Lahiru Himash Madusanka 
 wrote:
> OK. I'll check them and give feedback to you.
>
> On 12/31/11, zdenko podobny  wrote:
>> On Sat, Dec 31, 2011 at 11:21 AM, Lahiru Himash Madusanka <
>> lahiru.lahirumadusa...@gmail.com> wrote:
>>
>>> Yes. My program was Written in Visual Basic. Also uses Tesseract.exe
>>> So my program is using Shell() function in vb to call this
>>> tesseract.exe
>>>
>>> This is reason why I asked you to show more lines of code because
>>> there
>> could be several issues:
>>
>>- you could call shell wrong way (I do not have experience with 
>> VB,
>> but
>>it take me time to call tesseract from python correctly ;-) )
>>- there could be problem with image - please always make test
>>with eurotext.tif or phototest.tif that are included in tesseract
>> source
>>- make sure that your command works in command line (as I already
>>pointed: your command can not work because of missing quote(s)
>>
>>
>>
>>>  On Dec 31, 3:15 pm, zdenko podobny  wrote:
 On Sat, Dec 31, 2011 at 11:00 AM, Lahiru Himash Madusanka <

 lahiru.lahirumadusa...@gmail.com> wrote:
> No that code is i'm using at command line.
 You wrote: "I'm using tesseract in my own written Program"... now
 you
>>> wrote
 you use tesseract in command line. Please clarify.

> It doesn't gives me a
> output. The Language is Sinhala.
 I asked about programming language because of you mentioned that 
>> you
>>> wrote
 Program. Anyway you try to use English (-l eng)







> This is my command line code
> C:\Documents and Settings\CR-PC-01\Desktop\New Folder
> (2)\tesseract.exe "C:\Documents and Settings\CR-PC-01\My
> Documents\My
> Pictures\VidBlasterWS.jpg" output.txt -l eng -psm 3"
> As I already mentioned - it can not work because of missing
> quotes...
> On 12/30/11, zdenko podobny  wrote:
>> On Fri, Dec 30, 2011 at 5:14 AM, Lahiru Himash Madusanka <
>> lahiru.lahirumadusa...@gmail.com> wrote:
>>> I'm using tesseract in my own written Program. It uses
>>> Tesseract.exe
>>> as it's engine.
>>> Here is my code
>>> {C:\Documents and Settings\CR-PC-01\Desktop\New Folder
>>> (2)\tesseract.exe "C:\Documents and Settings\CR-PC-01\My
>>> Documents\My
>>> Pictures\VidBlasterWS.jpg" output.txt -l eng -psm 3"}
>>> well it does not look like code to me ;-) Also number of 
>> quotes
>>> is
>>> odd
> and
>> I think it should be even (at least in languages I am 
>> familiar).
>>> Please
>> provide more lines of your code...
>>> But this doesn't output a text file. But I can use it by
>>> Windows
>>> Command line.
>>> Does it mean it creates output when you run that command in 
>> cmd
>>> (command
>> line)?
>> Please provide more details (e.g. what programming language 
>> you
>> use).
>>> --
>>> You received this message because you are subscribed to the
>>> Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to
>>> tesseract-ocr@googlegroups.com
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>> --
>> You received this message because you are subscribed to the
 Google
>> Groups

Creating searchable pdf with tesseract and pdfbeads

2012-05-27 Thread Zdenko Podobný

Maybe you can write a blog (then post link to forum ;-) ) about
work-flow (needed changes, spent time at each step etc.)

This could be useful also for non tesseract communities.

--
Zdenko


Dňa 26.05.2012 09:01, Galt  wrote / napísal(a):
> Here's my pdf if anyone is interested:
>
> http://folkplanet.com/seanchlo/gortoir/GortOir.pdf
>
> Made with scanTailor, jbigenc, pdfbeads and Tess3.01.
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-27 Thread Zdenko Podobný

Well, thanks should go to David who fix the code and Galt who
reported/test it.

My problem (excluding lack of time;-) ) there is no working hocr
validity tool. hocr-tools[1] has something but it looks to have problem
with recent python PyXML[2] (I just did quick test). I saw some attempts
that replaced PyXML with lxml, but we remarks - "need to be tested"...

So it would be good if somebody fix hocr-tools.

Then there is need to create test case and compare hocr output of
different tools (e.g. cuneiform[3], djvu2hocr[4], MODI2hocr[5]... ) to
see what kind of information they are providing...

Any help with these tasks is appreciated ;-)

[1] http://code.google.com/p/hocr-tools/
[2] http://code.google.com/p/hocr-tools/issues/detail?id=2
[3] http://openocr.org/ or https://launchpad.net/cuneiform-linux
[4] http://jwilk.net/software/ocrodjvu
[5] http://code.google.com/p/modi2hocr/

--
Zdenko

Dn(a 26.05.2012 14:11, Sven Pedersen  wrote / napísal(a):
> Zdenko,
> Thanks for your work on that! I'm excited about using hOCR for some
> projects, so I'm really glad that we're moving towards standards
> compliance.
> --Sven
>
> On Sat, May 26, 2012 at 2:57 AM, zdenko podobny  wrote:
>> Discussion could be found in (closed and open) Issues (;-) ).
>>
>> Initial hOCR support[1] comes from issue 263[2] and was submitted
>> by amkryukov.
>> As you can see this patch implemented 'ocr_word'and 'xocr_word'. They are
>> not part of hOCR spec.
>>
>>  'xocr_word'was changed[3] to 'ocrx_word'based on issue issue 492[4] that
>> complained its non conformity with hOCR spec.
>>
>> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
>> to follow hOCR spec.
>>
>> I think we need to split this problem to several parts:
>>
>> A. Spec conformity. As far as I understood this is fixed (no report about
>> non conformity to hOCR spec).
>> B. Usability in other tools. This is a little bit tricky because it needs
>> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr
>> produce valid hOCR document and some tool is not able to process it, than
>> IMO that tool should be fixed... But it depends on problem. From my point of
>> view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be closed.
>> C. Other problems/enhancements: e.g. "empty words". This need to tested
>> (improved) but I think other tools should be able to process it.
>>
>> [1] http://code.google.com/p/tesseract-ocr/source/detail?r=333
>> [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=263&can=1&q=hocr
>> [3] 
>> http://code.google.com/p/tesseract-ocr/source/diff?spec=svn585&r=585&format=side&path=/trunk/api/baseapi.cpp
>> [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=492
>>
>> --
>> Zdenko
>>
>> On Wed, May 23, 2012 at 11:15 AM, Galt  wrote:
>>> Thanks, Zdenko!
>>>
>>> I found most of those same links too.
>>>
>>> FYI here is Tess3.01 output:
>>>
>>> 
>>> 
>>>
>>> 
>>>  Dul
>>> 
>>> 
>>>  fé
>>> 
>>> 
>>>  na
>>> 
>>> 
>>>  Gréine>> span>
>>>  
>>>  .
>>> 
>>> 
>>>  .
>>> 
>>> 
>>>  .
>>> 
>>> 
>>>  .
>>> 
>>> 
>>>  3
>>> 
>>>
>>> 
>>> 
>>>
>>> In a nutshell, Tess 3.01 outputs this pattern for each word:
>>>
>>> 
>>>  Dul
>>> 
>>>
>>> And judging by pdfbeads code, tess 3.00 did something like this for
>>> each word:
>>> Dul
>>>
>>> pdfbeads 1.0.9 added a hack just to keep it from crashing
>>> when the ratio was 0 because ocrx_word does not have bbox info.
 next if bbox == [0,0,0,0]
>>> This simple change does not actually make it use the bbox info that
>>> is in ocr_word.  In fact, the net result is that only the bbox info
>>> from
>>> the entire line is used, and actual word positions are just
>>> guestimated
>>> by the pdf viewer -- which is sometimes nearly right, and other times
>>> horribly wrong.
>>>
>>> I assume that the author of pdfbeads (Alexey Kryukov) understands this
>>> change in the output of Tess3.01.  Is he refusing to use ocr_word
>>> because
>>> it is not part of the standard ?  This was implied by Carlos.
>>>
>>> Is there some useful discussion of the hocr output change in 3.01
>>> somewhere?
>>>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

2012-05-28 Thread Zdenko Podobný


Dn(a 26.05.2012 23:09, Galt  wrote / napísal(a):
> Worderful news, Zdenko!
>
>> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
>> to follow hOCR spec.
> I wonder what he did?
see [1] and [2]. And I did today r729... We tested output with pdfbeads
(1.0.9) and ExactImage's hocr2pdf. pdf was checked in evince (linux pdf
viewer).

I found out that pdfbeads is not able to work ocrx_line so David revert
code to ocr_line.
hocr2pdf produces warning message for XML declarations, uses title value
strange way (for me) and expect all words in one line (e.g. I can not
indent ocrx_words). So there is not title value and no indentation of
ocrx_words.

So the current (r729) hocr output is compromise (from my point of view)
to work in pdfbeads and ExactImage's hocr2pdf. Output is valid XHTML 1.0
Transitional document.

[1] 
http://code.google.com/p/tesseract-ocr/source/diff?spec=svn726&r=726&format=side&path=/trunk/api/baseapi.cpp

[2]
http://code.google.com/p/tesseract-ocr/source/diff?spec=svn728&r=728&format=side&path=/trunk/api/baseapi.cpp

[3]  http://code.google.com/p/tesseract-ocr/source/detail?r=729

>> A. Spec conformity. As far as I understood this is fixed (no report about
>> non conformity to hOCR spec).
> Good.
>
>> B. Usability in other tools. This is a little bit tricky because it needs
>> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr
>> produce valid hOCR document and some tool is not able to process it, than
>> IMO that tool should be fixed...
>> From my point
>> of view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be
>> closed.
> If one uses 1.0.9, as I noted, it stops the segfault, that's true.
>
> But you end up with a pdf in which the highlighted words
> are anywhere from reduced-accuracy to unusable.
Please send to my e-mail example image. I could not reproduce (I tested
only one page).
>
> When I use my patch for use with Tess 3.01,
> it gets the word-start-specific highlights that
> crisply align to the beginning of each word.
>
> It also prevents the more horrible output problems
> where it sometimes went very wrong, like on my
> table of contents page.
>
> I did not make pdfbeads do anything new.
> It used to work fine with word-perfect starts etc.
> on Tess 3.00.  All I did is change the code so
> that it uses the Tess 3.01 hocr format.
3.00 hocr output is not according hocr spec. Also I found out that
pdfbeads do not recognize all hocr tags from spec (e.g. ocrx_line).
>
>> C. Other problems/enhancements: e.g. "empty words". This need to tested
>> (improved) but I think other tools should be able to process it.
>>
> I recently patched pdfbeads just a little bit more
> to tolerate and ignore empty words or lines.
> Very handy for people who have to hand-tweak
> a few mistakes in the hocr output. After deleting
> some text, a word or line may become empty.
Please send me image that generate empty words and your last pdfbeads
patch (just to see expected changes).

BTW: hocr patch for tesseract-ocr was sent by user amkryukov (see issue
263[4]).  pdfbeads authors name[5] is Alexey Kryukov. I guess it is the
same person and this is IMO reason why 3.00 hocr version worked with
pdfbeads even it do not follow hocr spec...

[4] http://code.google.com/p/tesseract-ocr/issues/detail?id=263
[5] http://rubygems.org/gems/pdfbeads

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tesseract remove non-text regions

2012-06-26 Thread Zdenko Podobný

Dn(a 26.06.2012 13:47, christy wrote / napísal(a):
> You can call the following from baseapi.cpp as :-
> Boxa* boxa = NULL;
> Pixa* pixa = NULL;
> Pix *pix = tesseract_->pix_binary();
tesseract_ is protected object so it is not available.
IMO it could be replaced with:
Pix *pix = api->GetThresholdedImage();
> tesseract::ImageFinder::FindImages(pix,&boxa,&pixa);
In 3.02 function was replaced/removed...

There is option to use FindImages(Pix* pix) from imagefind.h[1] but
unfortunately this header file is not included to installed files...
[1]
https://code.google.com/p/tesseract-ocr/source/browse/trunk/textord/imagefind.h#44

> Hope this helps..
>
>
> On Apr 30, 10:22 pm, Eslam Mohamed  wrote:
>> please, provide me with a code sample on how to use this method i have the 
>> same problem, i'm suffering from removing non text ares manually,thanks
>>
>>
>>
>>
>>
>>> 
>>> From: Pavel Mazniker 
>>> To: tesseract-ocr@googlegroups.com
>>> Sent: Monday, April 30, 2012 6:15 AM
>>> Subject: Tesseract remove non-text regions
>>> Hi,
>>>
>>> I work on text recognition in complex image that contains also not-textual 
>>> regions.
>>>
>>> There is remove_nontext_regions function in osdetect.cpp
>>>
>>> Is it called "automatically" when Recognize(...) function called or should 
>>> I call the function explicitly before performing recognition in order to 
>>> increase accuracy ?
>>>
>>> The signature of the function:
>>>
>>> voidremove_nontext_regions(tesseract::Tesseract*tess,BLOCK_LIST*blocks, 
>>> TO_BLOCK_LIST*to_blocks){
>>>
>>> 1. Does it affect image set to tess before ? if yes, then should user 
>>> expect non-textual regions removed after call to the function ?
>>>
>>> 2. What should be putted to blocks and to_blocks parameters ?
>>>
>>> Thanks.
>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com
>>> To unsubscribe from this group, send email to
>>> tesseract-ocr+unsubscr...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en- Hide quoted text -
>> - Show quoted text -


-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: debug mode for tesseract 3.01

2012-06-27 Thread Zdenko Podobný

Dňa 21.06.2012 08:22, eva charles wrote / napísal(a):
> Is the procedure to run the debug mode for tesseract 3.01 same as that
> given in the ViewerDebugging wiki (
> http://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging)?
>
> Please help!
>

Yes.

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tesseract can read a pic, but not a manipulated one

2012-07-03 Thread Zdenko Podobný

Dn(a 03.07.2012 10:32, Acanis wrote / napísal(a):
> Hey,
>
> Ill attach some pics to show my problem.
> I get a picture "Bohne.jpg", which is fine for tesseract. But the arrow get 
> sometimes a "1" and sometimes not!
1. I did find arrow in English traineddata, so it will not be recognized
correctly (as arrow).
2. Provide test case that can reproduce that sometimes is arrow
recognized and sometimes not. Otherwise nobody can help you.
> So I use GDI+ to draw something over the arrows. But then, I cant read some 
> of the result pictures with tesseract!
Provide example of your results that tesseract can not read. I expect
you checked if format is supported by Leptonica [1].
> Do you have an idea, why?!
> "Bohne_NoArrow.jpg", "Bohne_NoArrow_new.jpg" and 
> "Bohne_NowArrow_new_test.jpg" dont get results with tesseract...
You need to specified page segmentation and you will got result.
> Iam using tesseract V3.01 with standard configurations. I dont use a 
> special language pack and as parameter I just send the image and the the 
> name of the output-file.
>
> Thanks,
> Björn
>

[1] https://code.google.com/p/tesseract-ocr/wiki/ReadMe#Other_Dependencies

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: tesseract-ocr does not output desired results

2012-07-17 Thread Zdenko Podobný

Dňa 17.07.2012 02:32, Wei Liu wrote / napísal(a):
>
> My platform: Mac OS X 10.7.4 + Xcode 4.3.2 + OpenCV 2.4.0
>
>
> I want to use tesseract-ocr to recognize a few image (see attachment), and 
> I wrote a simple function to process the image using OpenCV, which is shown 
> as following
>
>
> char* wl_ocr(const IplImage* im)
>
> {
>
> // convert image to gray
>
> IplImage* imGray = wl_rgb2gray(im);
>
> cv::Mat matGray = imGray;
>
> 
>
> // initialize tesseract-ocr
>
> tesseract::TessBaseAPI tess;
>
> tess.Init("", "eng", tesseract::OEM_DEFAULT);
>
> tess.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
> );
>
> // tess.SetVariable("tessedit_char_whitelist", 
> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789");
>
> tess.SetPageSegMode(tesseract::PSM_AUTO);
>
> 
>
> // process the image
>
> // tess.TesseractRect(matGray.data, 1, matGray.step1(), 0, 0, 
> matGray.cols, matGray.rows);
>
> tess.SetImage((uchar*)matGray.data, matGray.size().width, matGray.size
> ().height, matGray.channels(), matGray.step1());
>
> tess.Recognize(0);
>
> 
>
> // get the recognized text
>
> char* text;
>
> text = tess.GetUTF8Text();
>
> 
>
> // clean up
>
> cvReleaseImage(&imGray);
>
> 
>
> return text;
>
> }
>
>
> I got the following results:
>
>
> 0.png --> CAUTION
>
> 1.png --> TILE WAL
>
> 2.png --> SLIPPERY
>
>
> The correct one should be:
>
>
> 0.png --> CAUTION
>
> 1.png --> TILE WALKWAY
>
> 2.png --> SLIPPERY WHEN WET
>
>
> The images seem to be pretty simple and clean, but my function cannot 
> output the whole words but only part of the words. I am not sure if I 
> misconfigure something in my code or if there is anything wrong with my 
> code.
>
>
> BTW. I did not train tesseract-ocr, I simply copy eng.traineddata to 
> certain folder (/usr/local/share/tessdata)
>

What version of tesseract are you using? At the moment I do not have
time to test your code, but I just tried this (using tesseract 3.02):

$ tesseract 0.png 0 && cat 0.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
CAUTION

$ tesseract 1.png 1 && cat 1.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
TILE WALKWAY

$ tesseract 2.png 2 && cat 2.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
SLIPPERY WHEN WET

it looks tesseract 3.02 is able to OCR your images correctly (e.g. you
should upgrade to 3.02 version or debug your code).

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: developing new program which passes memory buffer with OCR data to be recognized to tesseract library

2012-07-19 Thread Zdenko Podobný

Dňa 19.07.2012 03:32, newtotesseract wrote / napísal(a):
> Hi,
>
> Thanks for the suggestion.
> I found the thread "Include Tesseract in C++ 
> code" 
> closer to what I am looking for.
>
> But, did not get how to create static archive library (.lib) of tesseract 
> on Windows.
> I am using tesseract 3.01 code base and also the VS2008 projects for 3.01.
there is no library build on Windows for 3.01 (3.00) version. You need
to use 3.02 version (in svn).
>
> Thanks,
>
> On Thursday, July 19, 2012 1:13:03 AM UTC+8, sventech wrote:
>> Please search the previous posts, and you'll find several discussions: 
>> https://groups.google.com/forum/?hl=en&fromgroups#!searchin/tesseract-ocr/image$20buffer
>>  
>>
>> --Sven
>>  
>>
>> On Wed, Jul 18, 2012 at 4:19 AM, newtotesseract wrote: 
>>> Hi, 
>>>
>>> I am new to tesseract and trying to integrate tesseract library in our 
>>> existing text recognition application. 
>>>
>>> I want to pass the image file data through a memory buffer to tesseract 
>>> library for character recognition. 
>>>
>>> Can this be done using the "./api/.libs/libtesseract.a" library on 
>> Linux? 
>>> How can we do this on Windows? Can we generate just libtesseract.lib on 
>>> Windows, instead of the entire exe? 
>>>
>>> Please guide. 
>>>
>>> Thanks for your time and help. 
>>>
>>> Best Regards, 
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group. 
>>> To post to this group, send email to tesseract-ocr@googlegroups.com 
>>> To unsubscribe from this group, send email to 
>>> tesseract-ocr+unsubscr...@googlegroups.com 
>>> For more options, visit this group at 
>>> http://groups.google.com/group/tesseract-ocr?hl=en 
>>
>>
>> -- 
>> ``All that is gold does not glitter, 
>>   not all those who wander are lost; 
>> the old that is strong does not wither, 
>>   deep roots are not reached by the frost. 
>> From the ashes a fire shall be woken, 
>>   a light from the shadows shall spring; 
>> renewed shall be blade that was broken, 
>>   the crownless again shall be king.” 
>>


-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: broken link: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-3.02.tar.gz on http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/setup.html#initial-build-directory-setup

2012-07-26 Thread Zdenko Podobný

Dňa 26.07.2012 05:44, newtotesseract wrote / napísal(a):
> Hi,
>
> Link 
> http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-3.02.tar.gz
>  referenced 
> on Setting up 
> Tesseract-OCR
>  is 
> broken.
> I think, tesseract-3.02 is still not available and so this is failing.
>
>
As you can see this document is part of svn - e.g. it belongs to next
release 3.02 that was not publish yet and therefore link is not correct
(yet).

--
Zdenko


-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Feature request: ranking of dictionary word frequency

2012-08-23 Thread Zdenko Podobný

Dňa 23.08.2012 13:08, Nick White wrote / napísal(a):
> A great addition to training would be if one dictionary file was
> used, combining freq-words and all-words, and a relative frequency
> probability score was given to each word. This would allow more
> fine-grained scoring based on exactly how likely the word is to
> appear, which would be a win.
>
> Obviously for many cases such word frequency scores might be hard to
> generate, but for others (such as mine) it isn't at all, if the word
> list is generated from a large corpus of existing text.
>
> Would others find such a feature useful? Also, would I be better off
> posting this to the bug tracker?
>
Please post it as issue (Feature request)[1].

[1] http://code.google.com/p/tesseract-ocr/issues/entry
--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Extracting text from Gas Sign

2012-09-02 Thread Zdenko Podobný

Did you read FAQ? Did you search this forum for image preprocessing?
What suggestion you already tried?

--
Zdenko

Dňa 01.09.2012 22:17, Mark Stephens wrote / napísal(a):
> Perhaps it was a poor assumption but I would have thought it would be 
> relatively easy to extract the text from a gas sign.  I've tried several 
> different psm settings as well as different variations of the same image.   
>  Is there an easy way to improve my results?
>
> On the whole image I get this text:
>
>
> , , (W, 
> unleaded 2:52,
>
> Q75?
>
> -__,. ~;Q~
>
>  
>
> diAesel game
>
> @9991
>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: does the latest version of Tesseract OCR perform auto rotation on image out of the box

2012-09-02 Thread Zdenko Podobný

Dňa 02.09.2012 13:40, js wrote / napísal(a):
> *What steps will reproduce the problem?*
> 1. Ran the tesseract.exe [v 3.0+] with populated engilsh tessdata 
> directory. 
>
> tesseract.exe my_image.tif myimage_out -l eng -psm 0
>
>
> *What is the expected output? What do you see instead?*
> the myimage_out.txt was gibberish, image is rotated scan document. 
> Was expecting to see an auto rotated image
>
> *What version of the product are you using? On what operating system?*
> Latest tessreact version, using the installer on Win 7
>  
> any info would be helpful
>
If you need to rotate image I think you have chosen wrong application -
tesseract output is text not image.

If I remember correctly there was hint in forum how to use tesseract for
indication of text rotation (just four direction: 0, 90, 180, 270). Try
to search forum for "osdetect" and "page detection".

If you need text output, and you have problem rotated image, have a look
at issue 643[1] - there is problem reported, with some suggested
solution (and possible issues).

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=643

PS: please do not use for version statements like "[v 3.0+]" or
"Latest"- people still need to guess your version. If somebody will
search archive one year later he/she will have no clue what was the
latest version available today.

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: where is the tesseract 3.02 ?

2012-09-09 Thread Zdenko Podobný

yes, at the moment 3.02 is in svn repository at 
http://tesseract-ocr.googlecode.com/svn/trunk/


On 08.09.2012 21:54, fulberto100 wrote:

3.02 version is this?
svn checkout *http*://tesseract-ocr.googlecode.com/svn/trunk/
tesseract-ocr-read-only


On Saturday, September 8, 2012 12:38:28 PM UTC+3, zdenop wrote:



On Sat, Sep 8, 2012 at 11:01 AM, fulberto100 

wrote:
hi all.

i used tesseract 3.01 in my iOS app. but it scans slow.


you mean you use tesseract for OCR and not for scanning. Right?
  


after some googling i found that there is 3.02 version. (
http://stackoverflow.com/questions/11630640/how-can-i-make-tesseract-on-ios-faster
)
but i couldnt find the svn for 3.02.


Tesseract has only one svn repository.  And tesseract 3.02 is there. How
to install tesseract from svn is described on wiki. Unfortunately there are
no information when 3.02 will be released.

  
is there anyone who has an iOS sample with tesseract 3.02?



--
Zdenko



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Can tesseract accept regions?

2012-10-22 Thread Zdenko Podobný


On 22.10.2012 10:15, satuon wrote:

Can I specify the regions where usable text is to tessaract, so it doesn't
try to OCR the entire page? The page can contain pictures and other
non-text areas.


Yes you can - have a look at SetRectangle[1] in tesseract API

[1] 
https://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.h?r=760#334


--
Zdeno

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Working directories

2012-10-22 Thread Zdenko Podobný


On 22.10.2012 15:04, José Luis Rey wrote:

First Thanks for this fantastic project, I will try to collaborate all I
Can!!

It's possible to set working dirs on Command line?



Can you please clarify your understanding of "working dirs"?

--
Zdenko

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Can't make working tesseract in simple project vs2008 tess3.02.02

2012-11-04 Thread Zdenko Podobný


1. Download tesseract-ocr-3.02.02-win32-lib-include-dirs.zip
   

   and tesseract-ocr-API-Example-vs2008.zip
   

2. Unpack it to the same directory (so all relative references should work)
3. Open and compile example VC++ 2008 solution
4. Copy result executable, tesseract dll, leptonica dll (in case you
   did shared build) to the same directory (or put dlls to directory
   that is in your %patj%)
5. Run executable
6. Modified example project to you needs...
7. Share result, experience, know-how, write how-to etc...

--
Zdenko

On 29.10.2012 14:22, Dob wrote:

Hi! I'm realy sorry about my problem. I'm interesting in writing my own
project using tesseract. I had to read wiki, vs2008 faq, instruction.
I have succeded in building tesseract 3.02.02 from source, joined with
leptonica headers and lib's. Then i just created my own project in vs2008,
copied tesseract headers in my input directory (more headers than in
release notes), leptonica's headers and lib+dll, tesseract308(d).lib(dll)
in my lib directory. in additional include directories paste address of
directories with files, in linker paste addres of my lib folder.
added in my main.cpp

#include "allheaders.h"
#include "baseapi.h"
#include "strngs.h"

and than bad things going to happen
as like as

1>24.obj : error LNK2028: unresolved token (0A1C) "public: virtual

__clrcall tesseract::LTRResultIterator::~LTRResultIterator(void)"
(??1LTRResultIterator@tesseract@@$$FUAM@XZ) referenced in function "public:
virtual __clrcall tesseract::ResultIterator::~ResultIterator(void)"
(??1ResultIterator@tesseract@@$$FUAM@XZ)


1>24.obj : error LNK2019: unresolved external symbol "public: virtual

__clrcall tesseract::LTRResultIterator::~LTRResultIterator(void)"
(??1LTRResultIterator@tesseract@@$$FUAM@XZ) referenced in function "public:
virtual __clrcall tesseract::ResultIterator::~ResultIterator(void)"
(??1ResultIterator@tesseract@@$$FUAM@XZ)


I just need to make work simple application, but i failed. please help me.
If you don't know the reason of this errors may be if it is possible you
may send simple example vs2008 easiest application. thanks!



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Confidence in HOCR file

2012-11-14 Thread Zdenko Podobný

Word confidence in HOCR was implemented after releasing 3.02.02 ;-) so 
you need to checkout current svn and recompile.
For char confidence you need to use tesseract API (to program it). 
Search in this forum (maybe issues) for example.


--
Zdenko

On 14.11.2012 15:40, José Luis Rey wrote:

Hello Friends, It's posible to get char/word confidence in an HOCR file.

I'm working hard in a FREE Forms Scan/Recognition and i'm planning to use
Tesseract as OCR engine.

Regards,
Rey



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Having traindata files uncombined

2012-11-15 Thread Zdenko Podobný

Can you please use 3.02 version instead of 3.01 and write exact error 
message?
There is possibility to copy text from windows console - select relevant 
text/lines with pressed left mouse button then click with right mouse 
button outside of selected text but in console window - highlight will 
disappear and then you should have selected text in clipboard, so ctrl+v 
should paste it to e-mail...


--
Zdenko

On 12.08.2012 15:57, Chathuri Gunawardhana wrote:

I'm runing tesseract .01. My os is windows 7.I added the  files as you
said. But when I run the command tesseract input output bazaar it says
can't find the file eng.user-words. But the file is there.

Thanks!

On Sun, Aug 12, 2012 at 4:37 PM, zdenko podobny  wrote:


please post details (OS, tesseract version, exact error message...)

--
Zdenko

On Sun, Aug 12, 2012 at 7:32 AM, Chathuri Gunawardhana <
lanch.gunawardh...@gmail.com> wrote:


I followed
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data
 .But
I'm  getting error could not open user-data. User data file is actually in
correct location. But it says that file is not there. Any suggestions?

Thanks!

On Sat, Aug 11, 2012 at 6:48 PM, Chathuri Gunawardhana <
lanch.gunawardh...@gmail.com> wrote:



-- Forwarded message --
From: zdenko podobny 
Date: Sat, Aug 11, 2012 at 6:38 PM
Subject: Re: Having traindata files uncombined
To: tesseract-ocr@googlegroups.com


Yeah - it is much better ;-)
Unfortunately at the moment I do not have time for deep testing so here
are my suggestions:

- if you are using tesseract via api, try to set rectangles (instead
of whole image) with coords of city names to avoid "noise" (e.g. contours)
from map. tesseract is "noise sensitive" and noise can decrease ocr quality
- if you are using tesseract executable try to extract city names to
individual images
- after this you can start to play with dictionaries ;-)
- you can use user_words "outside" of traineddata file see [1]
- try to play with page segmentation parameter (psm)
- I am not aware how to increase (or decrease) strength of
dictionaries in tesseract 3.02 (e.g. to force tesseract to output only
words from dictionaries...)

I believe after this you can at least evaluate if tesseract is suitable
for your task...

[1]
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data

--
Zdenko

On Sat, Aug 11, 2012 at 2:23 PM, Chathuri Gunawardhana <
lanch.gunawardh...@gmail.com> wrote:


actually you can use this image under
http://www.taprobanetravels.com/images/map-of-sri-lanka.jpg. It is
high quality than above.


On Sat, Aug 11, 2012 at 4:40 PM, zdenko podobny wrote:


On Sat, Aug 11, 2012 at 12:58 PM, Chathuri Gunawardhana <
lanch.gunawardh...@gmail.com> wrote:


Image that I'm trying to identify is attached. Most words in here are
not identified correctly. I added these words to user words and combined.
But still didn't get the expected output.



your attached image has insufficient quality - I get no output for
it...

--
Zdenko

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en


  --

You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en



--
Chathuri Gunawardhana
Undergraduate at University of Moratuwa
Sri Lanka




--
Chathuri Gunawardhana
Undergraduate at University of Moratuwa
Sri Lanka

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en




  --
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en






--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group

Re: Can I configure Tesseract to always match a dictionary word?

2012-11-15 Thread Zdenko Podobný

Regarding "user_patterns_suffix" have a look at tesseract manual page [1].
I am not sure if there is possibility to force tesseract choose ocr
output from dictionary (I never tried it ;-) )
But you can increase dictionary strength with variables
language_model_penalty_non_freq_dict_word and
language_model_penalty_non_dict_word. See FAQ[2].

[1]
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data
[2]
http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?--

--
Zdenko

On 03.09.2012 16:02, ms wrote:

Aidano

Did you manage to solve this problem? We have the exact same question?
Would really be interested in any solutions

thanks

On Thursday, March 22, 2012 8:37:44 AM UTC+8, aidano wrote:

I'd like to configure tesseract with a small dictionary (~200 words) and
tell it to always choose the best match in the dictionary. Is that possible?

Also, when inspecting the source code I saw a variable in dict.h called
"user_patterns_suffix". Is there any documentation around this? I'd like to
see if I can use it to tell Tesseract that my images will always contain
one serial number that always has 19 characters with no spaces.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Problem with ViewerDebugging with tesseract 3.02.02

2012-11-18 Thread Zdenko Podobný

Did you compile tesseract by yourself? If yes, did you use standard
compilation process(autotools)? What kind of options you used? Can you
send me config.log? Or did you used version from your distribution?

Do you have only one installation of tesseract in your system?

--
Zdenko

On 18.11.2012 01:19, Linda Li wrote:

Here are some more details

(1)
Use the original svutill.cpp (not remove ">/dev/null 2>&1" )
export SCROLLVIEW_PATH
then run:
tesseract temp.jpg output segdemo inter

The terminal ouput is:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
ScrollView: Waiting for server...
ScrollView: Waiting for server...

The viewerdebugging window pops out, but with no segmentation picture in it.
see the attachment.

I use Ctrl+C to terminate the procedure.
I use quit of the icon of the ViewerDebugging window to close the window.

I run the command on terminal:
java -Xms512m -Xmx1024m -Djava.library.path=$SCROLLVIEW_PATH -cp
$SCROLLVIEW_PATH/ScrollView.jar:$SCROLLVIEW_PATH/piccolo-1.2.jar:$SCROLLVIEW_PATH/piccolox-1.2.jar
com.google.scrollview.ScrollView
The output is:

java.net.BindException: Address already in use
at java.net.PlainSocketImpl.socketBind(Native Method)
at
java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:353)
at java.net.ServerSocket.bind(ServerSocket.java:336)
at java.net.ServerSocket.(ServerSocket.java:202)
at java.net.ServerSocket.(ServerSocket.java:114)
at com.google.scrollview.ScrollView.main(ScrollView.java:397)

Now if I try to input the command "tesseract temp.jpg output segdemo inter"
again, the window will not pop out any more.

I log out, then log in.

(2)
Now I have logged in again.
export SCROLLVIEW_PATH=/home/tesseract-ocr-3.02.02/java

Now following the troubleshooting instruction I remove ">/dev/null 2>&1" in
the svutill.cpp file.
It would be like below.

const char* cmd_template = "-c \"trap 'kill %1' 0 1 2 ; java "
"-Xms1024m -Xmx2048m -Djava.library.path=%s -cp %s/ScrollView.jar:"
"%s/piccolo-1.2.jar:%s/piccolox-1.2.jar"
" com.google.scrollview.ScrollView"
" & wait\"";

The output in the terminal window is:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
ScrollView: Waiting for server...

The viewerdebugging window pops out, but with no segmentation picture in it.
Same as the one in (1) (in attachment)

(3)
Have no idea what is wrong. Thanks in advance.

On Friday, November 16, 2012 2:37:26 PM UTC-6, zdenop wrote:

On Fri, Nov 16, 2012 at 3:05 AM, Linda Li

wrote:
Can you post screenshot where is visible console with messages +

tesseract debugging window?

How can I solve the problem?
Did you tried unmodified source?

Re: problems with grayed background

2012-11-28 Thread Zdenko Podobný


On 28.11.2012 11:10, sascha4j wrote:

i have trouble to ocr an image like  in the attachment.
  
only the word text is recognized.
  
i tried several binarization algorithms, but without success.
  
does it make sense to binarize the image ? or has tesseract it's own

binarization?
  

Yes, it has.

any hints would be nice.
see e.g. http://www.sk-spell.sk.cx/through-tesseract-ocr-eye or search 
forum for words like thresholded, binary image etc.
  
greetings

sascha4j
  
  
  



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: cntraining-debug.exe problem in setup tesseract3.01 with visual studio 2010

2012-11-30 Thread Zdenko Podobný


On 30.11.2012 09:02, Iris wrote:

I follow the step according to
http://code.google.com/p/tesseract-ocr/wiki/ReadMePre3
I first install the windows installer then unpack the source code, win_vs
libraries and language datafile.
But when I compile the tesseract project uner visual studio 2010, a window
pops up saying cntraing-debug.exe can't be lauched and my application is
not right configured.
It's weird because I can successfully built the project.
I've been stuck here for several days and tried many ways.
Can anyone help me ?

Upgrade to recent version.
As far as I remember there were several problems in 3.01 with 
vs2008/vs2010 build. vs2010 was considered as experimental. That was 
reason to replace vs2008/vs2010 build  with new vs2008(created from 
scratch).



Thanks in advance.



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: problem with LED-fonts recognition ;(

2012-12-06 Thread Zdenko Podobný


Thanks for correcting me - really  I did not remember that information.

--
Zdenko


On 05.12.2012 08:06, Speedy wrote:

Just check out Ray's October 2007 paper "An Overview of the Tesseract OCR
Engine" where it says:

The first step is
a connected component analysis in which outlines of
the components are stored. This was a computationally
expensive design decision at the time, but had a
significant advantage: by inspection of the nesting of
outlines, and the number of child and grandchild
outlines, it is simple to detect inverse text and
recognize it as easily as black-on-white text. Tesseract
was probably the first OCR engine able to handle
white-on-black text so trivially.

And in fact, in our own application after image preprocessing we pass the
binarized image as a white-on-black image to tesseract and never had
problems with that. Of course, our training images are also white-on-black,
so this might also affect our findings.

Marcus


On Tuesday, December 4, 2012 2:58:26 PM UTC+1, zdenop wrote:

Where did you find "advertised features of tesseract is that it works
equally well for black-on-white and white-on-black text"? I never heard
about it.
See forum for other experience:
https://groups.google.com/d/topic/tesseract-ocr/XoX6t5Ih1IM/discussion

--
Zdenko

On Tue, Dec 4, 2012 at 2:42 PM, Speedy 

wrote:
Why is a black background a problem? One of the advertised features of
tesseract is that it works equally well for black-on-white and
white-on-black text.

Marcus


On Tuesday, December 4, 2012 11:11:36 AM UTC+1, zdenop wrote:


Search forum. I remember discussion about **similar topic.
AFAIR: tesseract has problem with letter(symbol) that consists of
several not connected parts (e.g. dots, lines) - solution should be to
preprocess image (blur).

Generally: black background is problem. Quality of image is too low
(JPEG, quality: 75), there is no information about DPI... Anyway this "LED"
font is not standard font, so maybe training will be need.

--
Zdenko

On Tue, Dec 4, 2012 at 12:43 AM, mike oldfield wrote:



Hello

I`d like to recognize LED-like numbers/digits.
I attached image (jpg, 680x320, brightness 65%, contrast 100%).
Is there any libraries or presets to decode these digits? For example
googledocuments conversion and free-ocr.com doesn`t work.






  --
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com

To unsubscribe from this group, send email to
tesseract-oc...@**googlegroups.com

For more options, visit this group at
http://groups.google.com/**group/tesseract-ocr?hl=en


  --

You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com 
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en



  


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Vector image input?

2012-12-14 Thread Zdenko Podobný

tesseract-ocr use leptonica for image IO. List of supported input type 
also depends on leptonica configuration e.g. if you did not compile jpeg 
support for leptonica, jpeg will be not supported in tesseract-ocr. So 
creating list of supported types would be tricky.

For possible supported type you can check e.g. leptonica source code[1].

[1] http://tpgit.github.com/Leptonica/imageio_8h_source.html#l00034

--
Zdenko

On 11.12.2012 15:36, thanatos thanatica wrote:


Unfortunately, I could not find a list of supported image input types
anywhere, so I just started to play with what I can produce. I tried SVG,
EPS, PDF, PS, and ODG, but all of them report as unsupported.
So the question remains: which vector type can I use as input? Or do I have
to convert to a pixel image first?
I would think that supplying vector images would greatly increase
accuracy...



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Training Tesseract for single digit

2013-01-08 Thread Zdenko Podobný


On 08.01.2013 17:13, sunitha raghurajan wrote:

I am using Tesseract to read license plate. The tesseract is giving wrong
output for digit six. My question is, Can I train the tesseract for single
digit 'six'. Any help truly appreciated.


Can you post a example of image (with digit 6) that you try to recognize?

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Segmentation fault while making box file from .tif image

2013-01-21 Thread Zdenko Podobný

https://code.google.com/p/tesseract-ocr/wiki/FAQ#actual_tessdata_num_entries_<=_TESSDATA_NUM_ENTRIES:Error:Ass 




On 20.01.2013 15:43, Firas almannaa wrote:

Hello,
I'm trying to make a .traineddata file for tesseract to recognize only
digits with multiple fonts.
first I write the text and made the .tif image, then when I try to make
.box file from the .tif image with the *command* :

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] 
batch.nochop makebox

*I get this error*  :

cygwin warning:

   MS-DOS style path detected: C:\Program Files 
(x86)\Tesseract-OCR\tessdata/configs/batch.nochop
   Preferred POSIX equivalent is: /cygdrive/c/Program Files 
(x86)/Tesseract-OCR/tessdata/configs/batch.nochop
   CYGWIN environment variable option "nodosfilewarning" turns off this warning.
   Consult the user's guide for more details about POSIX paths:
 http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in 
file tessdatamanager.cpp, line 55
Segmentation fault (core dumped)

   *and if i tried again I get *:
actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES:Error:Assert failed:in
file tessdatamanager.cpp, line 55
Segmentation fault (core dumped)
I'm using win8 and cygwin cmd to do the work.




--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Detecting Font size in android using Tesseract+Leptonica

2013-01-21 Thread Zdenko Podobný

See my test for font features[1]. It 
produces output for font size.


[1] http://pastebin.com/0dV84hBa

--
Zdenko

On 21.01.2013 14:33, Karthik Kannan wrote:

I'm making an android app to perform OCR on text using Tessearact and
Leptonica(for binarization and Otsu thresholding) libraries. so my question
is: Can I detect or atleast differentiate (larger/ smaller) between the
font sizes that the app reads?



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: How to use OCR in C++ project

2013-03-02 Thread Zdenko Podobný


On 01.03.2013 21:46, Vicky Patil wrote:

Hi,
I would like to ask whether anyone here knows how to use tesseract in c++?
I need to do some character recognition but I do not know how to implement
tesseract into my project. Should I be using the dll that comes with the
download or I should import the whole tesseract project?
Can someone show me snippets of codes of how i can implement tesseract?

What OS you use? What compiler, IDE?
Did you try (windows) installer?
Did you try example VC++ project?
Did you search at forum, issues (there are snippets)?
What did you try???

Thanks and Regards,
Viky Patil



--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: How to get desire words coordinate from characters coordinate

2013-03-04 Thread Zdenko Podobný


Depending on your skills:

a) You can analyze space between boxes to identify words (if you want to 
use box file)
b) You can parse tesseract hocr output (if you have no clue what is 
hocr, search in this forum)
c) You can use C++/C API of tesseract to create your own output - have a 
look at hocr implementation.


--
Zdenko

On 04.03.2013 12:21, SUBHADIP SINHA wrote:

Please help me if anybody know the solution !!!

THANK YOU ..

On Sunday, March 3, 2013 12:32:32 PM UTC+5:30, SUBHADIP SINHA wrote:

Hi,ALL

I  finally got the .box file with all characters coordinate from .png
file,Now i want to group the charecters from the .box file with words
and need the words coordinates.

I am using tesseract 3.02 with windows machine .
i run tesseract image.png image batch.nochop makebox command on
image.png file and
below result i got in box ,

t 45 16 90 91 0
h 94 16 151 102 0
e 155 16 211 79 0
l 208 16 238 102 0
o 241 16 304 79 0
n 308 16 366 79 0
d 369 16 430 102 0
o 433 16 496 79 0
n 500 16 557 79 0

from the above box file i want to find the coordinates of two words
which are the,london.

i have not configure any other files in tesseract folder,
please  help me with steps need to run following the above steps i done.

THANK YOU.



--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Tesseract Variable - tosp_min_sane_kn_sp

2013-05-12 Thread Zdenko Podobný


Dňa 10.05.2013 17:51, newbie wrote / napísal(a):

Hello all,

Could you please explain me how we use the variable "tosp_min_sane_kn_sp"?
and what is the value meaning?

Please advise.
Thanks.
Can you please explain me why do you want to use something you have no 
clue what it is (and even you don't know how to use it)?


---
Zdenko

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-21 Thread Zdenko Podobný

Please read the wiki
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

Zdenko

On Wed, Apr 20, 2016 at 10:37 PM, S.J. Becker 
wrote:

>
> I've attached two files.
>
> The first file is my original one. It returns empty page (with
> eng.traineddata).
>
> I noticed that there was no margin at the top and little at the bottom.
> So I used gimp to add about 4 pixels at the top and bottom. The result
> is the second attached file.
>
> This ocred properly.
>
> Command line:
> tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess
>
> Output:
> level   page_numblock_num   par_num line_numword_numleft
>  top width   height  conftext
> 1   1   0   0   0   0   0   0   336 110 -1<>
> 2   1   1   0   0   0   28  7   270 98  -1<>
> 3   1   1   1   0   0   28  7   270 98  -1<>
> 4   1   1   1   1   0   28  7   270 98  -1<>
> 5   1   1   1   1   1   28  7   270 98  91  A1.01
>
>
> A1.01  with a confidence of 91
>
> Should I file a bug? Or always pad my images with whitespace?
>
> thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/40a4828d-9a46-4e36-9b22-8b925f39a046%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y5jp1Dp_V5y5ETpXocfEHPb4xSXBc6kh3jzN88f4nvMQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-21 Thread Zdenko Podobný

On Thu, Apr 21, 2016 at 8:45 AM, ShreeDevi Kumar 
wrote:

> Please file an issue on GitHub repo with these files so that it can be
> looked at by the developers.
>
> Why? To waste their time??? E.g. presented command does not work ('tesseract
-c tessedit_create_tsv=1 tess_1_1b.tif tess'). If you really want to help
user then point him/her to correct wiki (using correct psm).

> However, for your app, add the whitespace margin to your images as part of
> preprocessing, since any fix may take a while.
>
> - sent from my phone. excuse the brevity.
> On 21-Apr-2016 11:49 am, "S.J. Becker"  wrote:
>
>>
>> I've attached two files.
>>
>> The first file is my original one. It returns empty page (with
>> eng.traineddata).
>>
>> I noticed that there was no margin at the top and little at the bottom.
>> So I used gimp to add about 4 pixels at the top and bottom. The result
>> is the second attached file.
>>
>> This ocred properly.
>>
>> Command line:
>> tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess
>>
>> Output:
>> level   page_numblock_num   par_num line_numword_numleft
>>  top width   height  conftext
>> 1   1   0   0   0   0   0   0   336 110 -1<>
>> 2   1   1   0   0   0   28  7   270 98  -1<>
>> 3   1   1   1   0   0   28  7   270 98  -1<>
>> 4   1   1   1   1   0   28  7   270 98  -1<>
>> 5   1   1   1   1   1   28  7   270 98  91  A1.01
>>
>>
>> A1.01  with a confidence of 91
>>
>> Should I file a bug? Or always pad my images with whitespace?
>>
>> thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/40a4828d-9a46-4e36-9b22-8b925f39a046%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUn1PWio0o-n_J80ihc-92Qv5q8JwkK6k%3DxM0qbd0shHw%40mail.gmail.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xFq1VdvRFmxnhwKhmVW9EXeAMqe01qFtsvqH4pa_g3uA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-23 Thread Zdenko Podobný

On Thu, Apr 21, 2016 at 11:53 PM, S.J. Becker 
wrote:

>
> This page only shows the same list I've seen many times before without
> any explanation:
>
> What does mean when it says "script detection"
>

See https://en.wikipedia.org/wiki/List_of_writing_systems and
https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/osdetect.cpp#L50


> I tried OSD and it did not automatically correct incorrect rotation (90
> degrees off)
>

detection = correction???



> I think I understand what "Automatic page segmentation" may mean but with
> / without OSD?
> Kinda need a full explanation.
>
OSD =  Orientation and script detection
>
>
> "vertically aligned text"???
>
> I guess I'll try #4: "Assume a single column of text of variable sizes"
> That best describes what I have but the default seemed to work
> in limited testing of my one and two liners.
>
> The wiki also has a waybackmachine link to a bug saying that adding
> whitespace helps. (Is that a current bug?, etc.)
>
> It is not bug. It is feature - if you use correct psm and you still can
not get correct result, maybe problem is that there is not sufficient
border.


> thanks
> scott
>
>
> On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:
>>
>> Please read the wiki
>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
>>
>> Zdenko
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/3d822711-56fa-41af-8b18-fefadf05a841%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xthtjasjQT4osnvGtBR_POq%2Bc3J9-OdhwuwZ33BxU4vQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Empty page result. Bug?

2016-04-23 Thread Zdenko Podobný

On Fri, Apr 22, 2016 at 12:27 AM, S.J. Becker 
wrote:

>
> I just did more testing.
>
> My one word or single character image works with
> -psm 7
> -psm 8
>
> my two or three lines of text image works with the default of
> -psm 3
> as well as
> -psm 4
>
> They both seem to work with
> -psm 6
>
> I may have to go with 6 even though my three line test with different
> font sizes should be done with 4 based on it's description.
>
> I feel it's a bug that 3 and 4 can't reliably handle simpler content.
> To get the most out of Tesseract, I must analyze the segmentation?!
>

Why analyze? Don't you know in advance if you are asking to OCR page or
just paragraph, line or word???

>
> That is why I had to go through the trouble of compiling leptonica;
> so that tesseract is smart enough that I don't have to re-invent the wheel.
>

Tesseract use leptonica as dependancy so it does not need to re-invent the
wheel.

>
> It seems that it's failing at the segmentation stage. If it finds nothing
> it could try again automatically with a more primitive setting. That is
> way more efficient than my process spawning tesseract twice as often.
>
> thanks
> scott
>
> On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:
>>
>> Please read the wiki
>> https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method
>>
>> Zdenko
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e9f5cb1a-374f-49b6-82ef-795b009e0180%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z%3DMXud_HpdEVp-2%2BU%3DpHucH_%3DBSPx1wPSFiseuAmSB2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Error when running tesseract in arabic

2016-05-22 Thread Zdenko Podobný

This is not error output. It is leptonica warning message that it can not
process given image format in memory and it has to use temp file instead
(AFAIK it is platform dependand)

Zdenko

On Sun, May 22, 2016 at 8:33 AM,  wrote:

> Hey guys,
> I'm running this command:
> tesseract photo.jpeg out -l ara
>
> And the error output is:
> Warning in pixReadMemJpeg: work-around: writing to a temp file
>
> My os is osx 10.11.
>
> Any ideas?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/de2105a1-e7da-4c80-ae80-9d476c7e82f0%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wzxhyCOHaHaW%2BXc8QLUh2hRHzwfiy9bk5Vnr3RPhW5%3Dw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Error when running tesseract in arabic

2016-05-22 Thread Zdenko Podobný

without input image nobody will help you.

Zdenko

On Mon, May 23, 2016 at 8:09 AM,  wrote:

> So... Do you have any idead why the output file is empty then?
>
> On Sunday, May 22, 2016 at 3:17:10 PM UTC+3, zdenop wrote:
>>
>> This is not error output. It is leptonica warning message that it can not
>> process given image format in memory and it has to use temp file instead
>> (AFAIK it is platform dependand)
>>
>> Zdenko
>>
>> On Sun, May 22, 2016 at 8:33 AM,  wrote:
>>
>>> Hey guys,
>>> I'm running this command:
>>> tesseract photo.jpeg out -l ara
>>>
>>> And the error output is:
>>> Warning in pixReadMemJpeg: work-around: writing to a temp file
>>>
>>> My os is osx 10.11.
>>>
>>> Any ideas?
>>>
>>> Thanks
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/de2105a1-e7da-4c80-ae80-9d476c7e82f0%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5e546de9-d9c5-4a5f-b5d9-20786a72fde3%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zoL5cw2iyfKT%3D5-w95Q%3DBxvF1fO%2BvTA7YSP6CjLXAinw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] configure: error: leptonica library missing -- FAQ is not working

2016-05-24 Thread Zdenko Podobný

have a look at config.log for reason (search for "pixCreate")

Zdenko

On Tue, May 24, 2016 at 10:51 AM, Dennis Park  wrote:

> faq says:
>   CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib"
> ./configure
> would work, but i still got the problem:
> ..
> checking for mbstate_t... yes
> checking for leptonica... yes
> checking for pixCreate in -llept... no
> configure: error: leptonica library missing
>
> my leptonica header files are present in /usr/local/include/leptonica and
> .so files(like liblept.so.5.0.0) present it /usr/local/lib
> I'm using tesseract-3.04.
>
> Does anyone has any clue?
> Thanks in advance.
>
> Dennis.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/698f0f8f-549d-45f0-8e72-354c7ef4e8ec%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xrYr1exQqc7A1L%2BUmEbhTSBMBNzoNO4d08B%3D1Q2LhBiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] configure: error: leptonica library missing -- FAQ is not working

2016-05-24 Thread Zdenko Podobný

That lines are just summary of test above that lines:

configure:16988: checking for leptonica
configure:17007: result: yes
configure:17009: checking for pixCreate in -llept
configure:17034: g++ -o conftest -g -O2 -I/home/work/.jumbo/include
-I/home/work/.jumbo/include/leptonica -L/home/work/.jumbo/lib conftest.cpp
-llept  -lpthread  >&5
/home/work/.jumbo/lib/liblept.so: undefined reference to `TIFFCleanup'
collect2: ld returned 1 exit status


E.g. your leptonica build is wrong. You are not able to use that leptonica
library for anything.

Elso you mentioned  you are using CPPFLAGS="-I/usr/local/include"
LDFLAGS="-L/usr/local/lib", but g++ is claiming something else:

*configure:2709: g++  -I/home/work/.jumbo/include -L/home/work/.jumbo/lib
conftest.cpp *



Zdenko

On Tue, May 24, 2016 at 1:40 PM, Dennis Park  wrote:

> 1689:configure:17009: checking for pixCreate in -llept
> 1742:| char pixCreate ();
> 1746:| return pixCreate ();
> 1812:ac_cv_lib_lept_pixCreate=no
>
>
> any clue?
>
> Thanks
> Dennis
>
>
>
> On Tuesday, May 24, 2016 at 7:13:32 PM UTC+8, zdenop wrote:
>>
>> have a look at config.log for reason (search for "pixCreate")
>>
>> Zdenko
>>
>> On Tue, May 24, 2016 at 10:51 AM, Dennis Park  wrote:
>>
>>> faq says:
>>>   CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib"
>>> ./configure
>>> would work, but i still got the problem:
>>> ..
>>> checking for mbstate_t... yes
>>> checking for leptonica... yes
>>> checking for pixCreate in -llept... no
>>> configure: error: leptonica library missing
>>>
>>> my leptonica header files are present in /usr/local/include/leptonica
>>> and .so files(like liblept.so.5.0.0) present it /usr/local/lib
>>> I'm using tesseract-3.04.
>>>
>>> Does anyone has any clue?
>>> Thanks in advance.
>>>
>>> Dennis.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/698f0f8f-549d-45f0-8e72-354c7ef4e8ec%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d299eb14-588e-4f5b-8558-c57a4706a82f%40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wTR6ReJHwaGuXbQSqJE48KO1nWDxxqJXpUdt3zDzJc-g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Getting a blank tessinput.tif file

2016-06-06 Thread Zdenko Podobný

Your leptonica build support only limited number of image formats. What
image you try to process?

Zdenko

On Mon, Jun 6, 2016 at 1:08 PM, Ashish Goel  wrote:

> Hello All,
>
> I am trying to do OCR on a bunch of images. Getting some failures, and I
> want to analyse them.
> So, to do that, I am trying to get the tessinput.tif file so that I can
> find out what input actually goes to tesseract.
>
> I am passing "-c tessedit_write_images 1" along with my tesseract to
> generate the tessinput.tif file.
> Tesseract does generates the tessinput file, but the file is blank (0
> bytes)
>
> Did I do anything wrong?
> I downloaded tesseract 3.14 and leptonica 1.73 and compiled both.
>
> Version as reported by tesseract -v are:
>
> tesseract 3.04.00
>  leptonica-1.73
>   libjpeg 8b (libjpeg-turbo 1.2.0) : libpng 1.2.46 : zlib 1.2.3.4
>
>
> Any help will be gretaly appreciated...
>
> Regards,
> Ashish
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f244961f-009c-40a7-8908-3e3bda490519%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yikU0j5on5Cf02npnv6a6G%3DPvDVamjZTY4nsDF0SynEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

1 2 >

1 - 100 of 180 matches

Mail list logo