[tesseract-ocr] How to detect page orientation correctly

2019-07-04 Thread hrishikesh kaulwar
I have a scanned documents in which few pages are scanned and oriented wrongly 90, 180, 270 But --psm 0 flag on tesseract to give orientation, Opencv Hough lines, Opencv Bounding box are not working. Could any one of you please suggest a method to detect orientation correctly and rotate to make

Re: [tesseract-ocr] Re: setting user-words in api?

2019-07-04 Thread Shree Devi Kumar
I have made a wiki page for using user_patterns with API. Please see https://github.com/tesseract-ocr/tesseract/wiki/APIExample-user_patterns You can try similarly for user_words. On Thu, Jul 4, 2019 at 4:40 PM Jochen Naumann wrote: > user_words_file also does not work, the file is not loaded

[tesseract-ocr] Joined and BROKEN symbols explanation in xxx.unicharset file

2019-07-04 Thread Abstract
Can anyone explain Joined and BROKEN symbols explanation in the autogenerated xxx.unicharset files ? As when training started, for some short time Joined symbol appears in the output log, then disappeares. But: after training finished, sometimes it (Joined) appeares even in the recognized outp

Re: [tesseract-ocr] Re: Recognized characters got multiplicated

2019-07-04 Thread Abstract
Also, there're some changes in results depending in recognition mode. All said was for PSM_SINGLE_CHAR mode. libtesseract-4.dll has bug for this mode, at least it produces some debug info that should not appear. After I changed to PSM_SINGLE_LINE, coordinates returned are much better. -- You re

Re: [tesseract-ocr] Re: Recognized characters got multiplicated

2019-07-04 Thread Abstract
Thanks, but as I see the problem is active since 2017, and no clear solution is present. Now I tried to get recognition result via iterator API, and that's really a strange thing. All the characted are listed, and those that are "duplicates" share the same coordinates as the correct ones, but h

[tesseract-ocr] src snippet question

2019-07-04 Thread JB Data31
Hi, Building *tesseract* for Android, I have a question about a src snippet. In the file *src/ccutil/fileio.cpp* there's a method *DeleteMatchingFiles()* . Scanning src, I only find this method in *src/training/pango_font_info.cpp* in *HardInitFontConfig().* Is there a way to execute a bin, as *

Re: [tesseract-ocr] Re: Recognized characters got multiplicated

2019-07-04 Thread Shree Devi Kumar
This is an open issue - see https://github.com/tesseract-ocr/tesseract/issues/1060 and other related issues On Thu, Jul 4, 2019 at 5:33 PM Abstract wrote: > Some more information on my trained data: > real data:12345678903542331100244117021234567 > recognized: 1234567890354233141110024411702

[tesseract-ocr] Re: Recognized characters got multiplicated

2019-07-04 Thread Abstract
Some more information on my trained data: real data:12345678903542331100244117021234567 recognized: 12345678903542331411100244117021234567 (see, instead of 11 were reported several chars 14111 - in this case it does not like letter "4") another pair real/recognized: 2345678905423342392200712

Re: [tesseract-ocr] tesseract bug in windows

2019-07-04 Thread _ Flaviu
Thank you, I have uploaded an issue there ... On Wednesday, July 3, 2019 at 6:27:06 PM UTC+3, shree wrote: > > Bugs are to reported in github under issues. If it is specific to windows > and uses prebuilt binaries, please report in repo of the source. > > On Wed, 3 Jul 2019, 20:26 _ Flaviu, > >

Re: [tesseract-ocr] Re: setting user-words in api?

2019-07-04 Thread Jochen Naumann
user_words_file also does not work, the file is not loaded ( checked with file monitor). Am Mi., 3. Juli 2019 um 20:31 Uhr schrieb Zdenko Podobny : > If command line work for you that most easy way is to follow tesseract > executable code[1]: > IMO you need to use variable user_words_file; AFA

Re: [tesseract-ocr] Choice Iterator only shows one choice for each character

2019-07-04 Thread shree
See related discussion at https://github.com/tesseract-ocr/tesseract/issues/2532 On Monday, July 1, 2019 at 3:51:15 PM UTC+5:30, Jochen Naumann wrote: > > Thanks, this seems to be what I need. But how do I set this > lstm_choice_mode with the api? > > Am Montag, 1. Juli 2019 11:55:02 UTC+2 schri