Re: Format of blacklist string

2011-03-30 Thread Dmitri Silaev
Actually, there's an issue already on this point: http://code.google.com/p/tesseract-ocr/issues/detail?id=455&sort=-id I don't see any progress on it, though Warm regards, Dmitri Silaev On Thu, Mar 31, 2011 at 7:55 AM, patrickq wrote: > Upon further experimentation I think I found out that t

Re: Re: Re: tesseract improve the reject rate ?

2011-03-30 Thread Dmitri Silaev
If you're going to elaborate on this issue, it would be great if you share your findings with the community. This topic might be of interest not only for newbies but for experienced users too. Dmitri -- You received this message because you are subscribed to the Google Groups "tesseract-ocr"

Re: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread Dmitri Silaev
Could you give us a link to where the text of this article can be downloaded from? Can't find it anywhere, only the title and authors. On Thu, Mar 31, 2011 at 6:09 AM, Cong Nguyen wrote: > Please refer to "OPTIMIZING SPEED FOR ADAPTIVE LOCAL THRESHOLDING ALGORITHM > USING DYNAMIC PROGRAMMING". >

Re: Re: Re: tesseract improve the reject rate ?

2011-03-30 Thread Dmitri Silaev
Liuguanqiang, Well, now I guess, I understand what you want. You have a text consisting of arbitrary characters: digits and Chinese letters. You goal is to find in this text a particular fragment, knowing it can be comprised by digits 5678 only. Confirm? If so, the first thing to do is to set the

Re: Format of blacklist string

2011-03-30 Thread patrickq
Upon further experimentation I think I found out that the whole whitelist is render irrelevant whenever a character in the blacklist is NOT in the training set ... this is crazy of course but it appears to be the case, as if the code handling this list decides to stop processing the list if one of

Re: Re: Re: tesseract improve the reject rate ?

2011-03-30 Thread Saurabh Gandhi
Thats simple, use the "0123456789" as the whitelist and then write a code on top of it to convert the unwanted numbers to null. Your code can handle this instead of tesseract. -- Regards, Saurabh Gandhi 2011/3/31 liuguanqiang > For example, I use the eng.traineddata(setwhitelist to "0123456

Re: Re: Re: tesseract improve the reject rate ?

2011-03-30 Thread liuguanqiang
For example, I use the eng.traineddata(setwhitelist to "0123456789") to recognize the digital in the following picture: The tesseract output the correct result: "24013091" Now, I have known there are only "5678" in the input image, So I setwhitelist to "5678". On the above image, the tesseract

Format of blacklist string

2011-03-30 Thread patrickq
I am trying to provide a black list with UTF8 characters specified using their byte codes, as follows: // U+FB00 ff ef ac 80LATIN SMALL LIGATURE FF // U+FB01 fi ef ac 81LATIN SMALL LIGATURE FI myTess->SetVariable("tessedit_char_blackli

RE: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread Cong Nguyen
Please refer to "OPTIMIZING SPEED FOR ADAPTIVE LOCAL THRESHOLDING ALGORITHM USING DYNAMIC PROGRAMMING". Complexity is: O(n), n is number of pixels. -Original Message- From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Max Cantor Sent: Thursday, March

Re: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread Max Cantor
Yes. I've had great experience with sauvola binarize from leptonica. Gamer works too but is much much slower On Mar 31, 2011, at 0:02, cong nguyenba wrote: > I have another approach for you here: try to apply binarization using > adaptive threshold! Delving into engine by following apdaptive >

Re: disable newline in table layout recognition

2011-03-30 Thread cong nguyenba
Page layout in tesseract engine maybe not enough robustness! You can get more approachs from ICDAR conference! On Wednesday, March 30, 2011, Dmitri Silaev wrote: > The -psm command line arg does work. In rev580. > But still an issue in rev549. > > So the easiest way for you, Patrick, is to checko

Re: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread cong nguyenba
I have another approach for you here: try to apply binarization using adaptive threshold! Delving into engine by following apdaptive classification in source code for speedup! I think it is enough for your expectation! On Wednesday, March 30, 2011, Dmitri Silaev wrote: > P.S.: If you're still sur

Re: disable newline in table layout recognition

2011-03-30 Thread Dmitri Silaev
The -psm command line arg does work. In rev580. But still an issue in rev549. So the easiest way for you, Patrick, is to checkout the latest revision... Regards, Dmitri -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, se

Re: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread Dmitri Silaev
P.S.: If you're still sure that reasonable downscaling of your images sacrifices the accuracy, please share one or two of your *unprocessed* images to investigate further. And I'd suggest to keep up with the latest revisions of Tesseract. The API changes significantly, but Tess is definitely being

Re: tips for improving Tesseract accuracy and speed...

2011-03-30 Thread Dmitri Silaev
Depending on the quality of your source images, I think it'd be reasonable to scale them down in order for letters to have the height of 40 pixels or so. In that way Tesseract will just have to do a bit less work - scan lesser pixels and construct shorter glyph outlines. The accuracy may suffer ev

Re: disable newline in table layout recognition

2011-03-30 Thread zdenko podobny
On Wed, Mar 30, 2011 at 8:55 AM, Max Cantor wrote: > I had a similar issue. I couldn't get the config to work but basically > added this line to my code and it worked: > >api.SetPageSegMode(tesseract::PSM_SINGLE_COLUMN); > > For some reason, the tesseract binary doesn't pick up the config, b

Re: Problem with Tesseract 3.00

2011-03-30 Thread zdenko podobny
Hi, unfortunately some fixes regarding windows build was committed after releasing 3.00 version (=revision 498). I thought about 3.00.1 release (=revision 525) and as "temporary solution" I created 3.00.1 tesseract.exe (somebody ask for it). Than I changed my mind because it looks that developers

Re: disable newline in table layout recognition

2011-03-30 Thread Max Cantor
I had a similar issue. I couldn't get the config to work but basically added this line to my code and it worked: api.SetPageSegMode(tesseract::PSM_SINGLE_COLUMN); For some reason, the tesseract binary doesn't pick up the config, but I copied the binary source and added that. Max On Mar

Problem with Tesseract 3.00

2011-03-30 Thread mohamed amine
Hello I have some problems and many questions and i hope you will have answers: 1) when loading the hole project, "combine_tessadata" did not load with the 17 project : is this a problem that causes a problem when generating tesseract.exe. 2) Should i exucute tesseract-3.00.1.exe to have the rig