[tesseract-ocr] Integrate Tesseract 4 Source into Xcode Project (C++)

2019-02-01 Thread Aaron
Hi, I'm trying to integrate Tesseract Source into a C++ project in Xcode 10 (on MacOS High Sierra). When copying the source directory into the project, it's not self-evident as to how the necessary API headers are to be included (to avoid errors). Any advice would be much appreciated. -Aaron

Re: [tesseract-ocr] Single character OCR not working

2019-02-01 Thread Zdenko Podobny
I am not sure what version of tesseract you have installed, but current version (build from source) produce "K" as output for several psm (6,7,8, 10)... e.g. tesseract unnamed.png - --psm 6 Warning: Invalid resolution 0 dpi. Using 70 instead. K So first check tesseract version Zdenko po 28. 1.

[tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread George Varghese
Thanks for help and it is working as expected . Totally appreciate your help. uzn range is honored by tesseract. I need to fine tune the range little more but working completely as desired. The server cpu resources does not take the spike to 90 % as was the case before - but now in the

[tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread George Varghese
My command is tesseract doc.tif doc --psm 4 On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote: > > I am using tesseract v4 to convert .tiff file to text, only the first > page. The script - run from command line on Windows 2012 takes almost 8 > seconds to convert on

Re: [tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread Zdenko Podobny
In this case command should be: tesseract.exe SCREENCAPTURE.JPG output --psm 4 and attached SCREENCAPTURE.uzn file must be at the same location as SCREENCAPTURE.JPG Zdenko pi 1. 2. 2019 o 18:53 George Varghese napísal(a): > > > On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Vargh

Re: [tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread Zdenko Podobny
It works, but your command is wrong... Did you read link I posted? It should be: tessecract doc.tif doc --psm 4 Zdenko pi 1. 2. 2019 o 18:52 George Varghese napísal(a): > The UZN did not work. Attached the screen shot .tif file - some > confidential info removed. > > My command was tessecrac

[tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread George Varghese
On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote: > > I am using tesseract v4 to convert .tiff file to text, only the first > page. The script - run from command line on Windows 2012 takes almost 8 > seconds to convert only the first page. using the configuration. The

[tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread George Varghese
The UZN did not work. Attached the screen shot .tif file - some confidential info removed. My command was tessecract doc.tif doc.uzn output -l eng --oem 1 --psm 4 -c tessedit_page_number=1 The doc.uzn was in the folder as the .tif file 20 40 400 200 text On Wednesday, January 30, 2019 at 11

Re: [tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Prabhakar Tayenjam
I have done tesstrain using the langdata-lstm, still get the normalisation failed error. I have not done substitutions though. I would like to know how this error effects the accuracy of the newly trained model -- You received this message because you are subscribed to the Google Groups "tesse

Re: [tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Shree Devi Kumar
Please run a substitution script to clean up your training text. eg. for Hindi I use the following sed script. s/ / /g s/्‌ं/ं/g s/‌्‌ृ/‌ृ/g s/ा्/ा/g s/ि्/ि/g s/ी्/ी/g s/ु्/ु/g s/े्/े/g s/ै्/ै/g s/ो्/ो/g s/ौ्/ौ/g s/ॊ्/ॊ/g s/ॆ्/ॆ/g s/ॉ्/ॉ/g s/ृ्/ृ/g s/°//g s/²//g s/³//g s/¹//g s//ः/g s//॑/g s//॒

[tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Prabhakar Tayenjam
I have looked at it again closely. I think I have something. Please look to clarify. The string giving this error are the string that contains ' ৌ', 'া', 'ী', ' ো' etc. Normalization failed for string 'ো' Normalization failed for string 'ৌ' Normalization failed for string 'ী' And this characte

Re: [tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Shree Devi Kumar
Use training_text from langdata_lstm which has larger training text used for LSTM training (for tessdata_best and tessdata_fast). On Fri, Feb 1, 2019 at 7:14 PM Prabhakar Tayenjam wrote: > This happens everytime I use tesstrain.sh. I use a training text combining > the default provided in the la

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-02-01 Thread Lorenzo Bolzani
Yes, old OCR solutions use binarized content but I see this as a legacy limitation. It was probably done to speed up the processing and also, I suppose, because the algorithms used would not benefit from the extra gray details anyway. Old ocr tech was also print oriented so the text was already nea

[tesseract-ocr] Re: normalisation failed for string error

2019-02-01 Thread Prabhakar Tayenjam
This happens everytime I use tesstrain.sh. I use a training text combining the default provided in the langdata (https://github.com/tesseract-ocr/langdata) and some other text collected manually. I tried using only the default training text provided in the langdata and get the same result. I a

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread PA
Thanks! Will try this. El vie., 1 de feb. de 2019 10:06, Shree Devi Kumar escribió: > https://github.com/tesseract-ocr/tessdata_best/blob/master/spa.traineddata > https://github.com/tesseract-ocr/tessdata_fast/blob/master/spa.traineddata > > Alternately, look up the file size of spa.traineddata

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread PA
I was actually thinking the same thing, however, plain tesseract (with ou options) works, so I don't know what to think. Will look in the forum for similar issues. El vie., 1 de feb. de 2019 10:04, Zdenko Podobny escribió: > IMO if any program can cause crash of computer/reboot of system you h

Re: [tesseract-ocr] normalisation failed for string error

2019-02-01 Thread Shree Devi Kumar
Looks like two maatraas together or maatraa followe by vedic accent - does not meet Indic normalization rules. What training text are you using? On Fri, Feb 1, 2019 at 5:58 PM Prabhakar Tayenjam wrote: > What is causing this error and what are the possibles fixes?? > > Normalization failed for

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread Shree Devi Kumar
https://github.com/tesseract-ocr/tessdata_best/blob/master/spa.traineddata https://github.com/tesseract-ocr/tessdata_fast/blob/master/spa.traineddata Alternately, look up the file size of spa.traineddata on your desktop and laptop. You can try copying the one from laptop (working version) to deskt

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread Zdenko Podobny
IMO if any program can cause crash of computer/reboot of system you have a big problem (not related to tesseract). Please try to search forum - I think there was already somebody with similar issue. Zdenko pi 1. 2. 2019 o 13:45 PA napísal(a): > Are those test data for Spanish language? > > Al

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread PA
Are those test data for Spanish language? Also I can not give error message as tesseract crashes making the desktop to reboot. Do you know a way to save to text file? El vie., 1 de feb. de 2019 09:39, Shree Devi Kumar escribió: > >This was installed from github, and tessdata comes from > https:

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread Shree Devi Kumar
>This was installed from github, and tessdata comes from https://github.com/tesseract-ocr/tessdata/blob/master/spa.traineddata Please try with traineddata file from tessdata_best and tessdata_fast Also give the exact error message/console output. On Fri, Feb 1, 2019 at 5:43 PM PA wrote: > On m

[tesseract-ocr] normalisation failed for string error

2019-02-01 Thread Prabhakar Tayenjam
What is causing this error and what are the possibles fixes?? Normalization failed for string 'া' Word started with a combiner:0x982 Normalization failed for string 'ং' Word started with a combiner:0x9c1 Normalization failed for string 'ু' Word started with a combiner:0x9c0 Normalization failed fo

Re: [tesseract-ocr] Tesseract Crashes for Spanish Language

2019-02-01 Thread PA
On my laptop: tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found SSE This was installed from Kubuntu packages, so the tessdata comes from there.

[tesseract-ocr] run_shape_clustering argument not recognised in tesseract 4

2019-02-01 Thread Prabhakar Tayenjam
I am doing lstm training for Bengali in tesseract 4.0.0-255-gfc55. While running tesstrain.sh with run_shape_clustering argument, I get the error: ERROR: Unrecognized argument --run_shape_clustering. This argument is required for training indic languages. Any solutions?? -- You received this m

[tesseract-ocr] Re: Text Association to particular bounding points

2019-02-01 Thread Raghavender Rohilla
Please guys any help will be a huge favor ! have been stuck on this since days now. Thanks! On Friday, February 1, 2019 at 10:18:23 AM UTC+5:30, Raghav Rohilla wrote: > > Hi, I am working on a project in which i am trying to achieve text > detection and then associating it to the particular

Re: [tesseract-ocr] Re: convert a .tiff file to text file

2019-02-01 Thread Zdenko Podobny
What does not work? uzn? It works with tesseract 4 - I just test it. If you are really interesting in help/reply please be specific and detailed what you did, what you use and provide examples for reproducing problems. Zdenko pi 1. 2. 2019 o 2:35 George Varghese napísal(a): > Does not work in

Re: [tesseract-ocr] Question about "Failed loading language"

2019-02-01 Thread Shree Devi Kumar
try with --tessdata-dir /usr/local/share/tessdata/ On Fri, Feb 1, 2019 at 12:29 PM nampyo hong wrote: > [image: tesseract.PNG] > When I was running tesseract 3.0.4, there was no problem. > > I tried to install tesseract 4.0.0 in ubuntu 16.04 by building it from > source, but there was an issue.

Re: [tesseract-ocr] Re: Tesseract for invoices

2019-02-01 Thread Zdenko Podobny
What I heard 😀: because of complexity/variability of input images companies doing invoice digitization (with tesseract) use custom solution for image/page analyze (e.g. finding position for invoice number) and using tesseract only for OCR process. Zdenko pi 1. 2. 2019 o 9:02 Kristóf Horváth nap

[tesseract-ocr] Re: Tesseract for invoices

2019-02-01 Thread Kristóf Horváth
Well first i would get a buch of examples of different positions and run it through OCR with auto segmentation and with best traineddata. Then from results i would try different segmentation to see difference. There is a chance you wont have to train it. When extraction is accurate enough you ju

[tesseract-ocr] Re: Tesseract for invoices

2019-02-01 Thread Kristóf Horváth
2019. január 31., csütörtök 23:49:43 UTC+1 időpontban Shailesh Barve a következőt írta: > > Hey all, > I have a requirement to process invoices and extract few data elements > from it (e.g. invoice number, date, customer name, total amount). > Incoming invoices are of different formats with rel