Re: [tesseract-ocr] Tesseract with Thai language

2019-01-30 Thread Shree Devi Kumar
> I am able to extract the Thai characters perfectly on Windows environment whereas when I extract the same on Ubuntu I found spaces between the characters in the extracted text. What are the exact versions of tesseract in both environments? `tesseract -v` Also, which trineddata file are you usi

[tesseract-ocr] Re: I have a question about making a traineddata (tesseract 4.0 LSTM)

2019-01-30 Thread Kristóf Horváth
2018. március 1., csütörtök 5:02:00 UTC+1 időpontban 이경준 a következőt írta: > > Hi > > I have a question about making a traineedata (tesseract 4.0 LSTM) > > Tutorial Guide to lstmtraining > Crea

[tesseract-ocr] How to create lstm-unicharset and similar files for tesseract training?

2019-01-30 Thread Kristóf Horváth
im not the only one with this problem Friend in trouble . The wiki for tesseract is very much missing everything a wiki should have. The problem is that training tesseract requires certain files from the get go, but i

[tesseract-ocr] Re: How to create lstm-unicharset and similar files for tesseract training?

2019-01-30 Thread Kristóf Horváth
I also posted it on https://superuser.com/questions/1399989/how-to-create-lstm-unicharset-and-similar-files-for-tesseract-training 2019. január 30., szerda 9:44:12 UTC+1 időpontban Kristóf Horváth a következőt írta: > > im not the only one with this problem Friend in trouble >

[tesseract-ocr] My confusion about "Fine Tuning for ± a few characters"

2019-01-30 Thread 易鑫
Hello,everyone: I get some confusion about "*Fine Tuning for ± a few characters*". In the wiki *(* https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters *),* it says "*Modify**langdata/eng/eng.training_text to include some samples of ±."*

Re: [tesseract-ocr] How to optimize tesseract to maximum speed for single number (several digits) recognition

2019-01-30 Thread Lorenzo Bolzani
Did you check this? https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=147781&start=50#p972790 Il giorno mer 30 gen 2019 alle ore 08:09 Jan Pohanka ha scritto: > I have already done that but haven't found anything interesting. > I tried to ask here if there are eg. any part of algorithms t

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-01-30 Thread Lorenzo Bolzani
Zdenko, are you 100% sure that the image is binarized before being fed to the neural network? It looks like a big waste of information to me. Il giorno mer 30 gen 2019 alle ore 07:56 Zdenko Podobny ha scritto: > That is not true: you do not need to transform image to grayscale. Any > image is a

Re: [tesseract-ocr] How to optimize tesseract to maximum speed for single number (several digits) recognition

2019-01-30 Thread Jan Pohanka
You were right, I just found that my RPi is throttling. It explains the slowing down. Now I'm checking if heatsink could help. So I expect that there is nothing to tune up in my loop. I will check if I can try some smaller model. best regards Jan Dne středa 30. ledna 2019 11:15:19 UTC+1 Lorenz

[tesseract-ocr] Re: Tesseract OCR not performing well even after data cleaning and transformations on black background data

2019-01-30 Thread settysme
@farhad khalafi, Thank you for the reply. I tried but I am getting almost same result as that of my code output. On Wednesday, January 30, 2019 at 12:05:47 PM UTC+5:30, sett...@gmail.com wrote: > > I have written some code for an image data to be extracted using > tesseract, in Python, i.e Pyt

Re: [tesseract-ocr] My confusion about "Fine Tuning for ± a few characters"

2019-01-30 Thread Shree Devi Kumar
> it says "*Modify**langdata/eng/eng.training_text to include some samples of ±."* *That is part of a training tutorial, where the goal is to add a new character **± to the eng.traineddata so that it can be recognized by the finetuned traineddata.* It is only an example. You have to modify it b

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread Jul ius
Still interested in example of box files for tesseract 4... Doesn't anyone has an example for us? It would be great to see how we have to handle spaces in textlines. Am Montag, 28. Januar 2019 15:01:49 UTC+1 schrieb Jul ius: > > Hi, > > that would also be my next question. Don't we need anythi

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread Shree Devi Kumar
AFAIK the textline option for box files (WordStr) has NOT been implemented. The wordaround has been to use the bounding box for the whole line for every character on a line. Ref: ocrd-train project Example: च 0 0 1965 128 0 त् 0 0 1965 128 0 व 0 0 1965 128 0 ा 0 0 1965 128 0 र 0 0 1965 128 0 ि 0

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread Shree Devi Kumar
also see https://github.com/tesseract-ocr/tesseract/blob/cfa787d976007f5866ce25fbd8e2a0223fc40fda/src/ccstruct/boxread.cpp#L165 https://github.com/tesseract-ocr/tesseract/blob/3ac33d59aeb93fc9dab13874a64ab0b73690d5eb/src/ccmain/applybox.cpp#L36 On Wed, Jan 30, 2019 at 5:15 PM Shree Devi Kumar w

[tesseract-ocr] Re: AttributeError: module 'pytesseract' has no attribute 'pytesseract'

2019-01-30 Thread settysme
Also put Tesseract and Pytesseract path in the environment variables On Monday, May 28, 2018 at 9:32:13 AM UTC+5:30, bryan lee wrote: > > Hi All, > > Help needed, i know this is very basic as i am not able to continue from > here. > I was trying to use pytesseract. > Below is what i have done:

[tesseract-ocr] pytesseract: errors with recognized digits

2019-01-30 Thread Aaron Spell
*Hi! I'm started using to tesseract with python and have some questions* *This example how i trying get recognized image:* import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r'c:\Program Files (x86)\Tesseract-OCR\tesseract.exe' x = Image.open("err1.png") text =

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-01-30 Thread Zdenko Podobny
try: tesseract image - get.image which calls GetThresholdedImage() Zdenko st 30. 1. 2019 o 11:17 Lorenzo Bolzani napísal(a): > > Zdenko, are you 100% sure that the image is bin

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-01-30 Thread Lorenzo Bolzani
I suppose this means that the image is always binarized, is this correct? Is there any way to avoid it? Does this binarization happens by default during training too? I fine tuned a few models using grayscale images. Do you thing the neural network received binary black/white pixels or the gray

Re: [tesseract-ocr] pytesseract: errors with recognized digits

2019-01-30 Thread Lorenzo Bolzani
Try psm 6 Try a few small upscales so that the text is between 30-40 px and see if it helps, like 31, 33, 35, 37, 39 (on a large test set). Try to crop all the white border (imagemagick, gimp) and see if it helps. Otherwise you need to fine tune the model: https://github.com/tesseract-ocr/tesse

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Daniel Ferenc
So, I have figured out what was I doing wrong: - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and they were obviously missing some langdata which I downloaded from the repository - There was also a need to get the Latin.unicharsert file - And finally I didn't notice an error i

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Lorenzo Bolzani
If you have images of the cards with the corresponding text you could train it on the cropped/cleaned text directly. Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc ha scritto: > So, I have figured out what was I doing wrong: > > - I am using tesseract packages I got from apt on ubuntu 18

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Daniel Ferenc
I'm not sure how exactly would I setup that (regarding tesseract training) BUT there are about 44000 (english) cards at this time and a high resolution image of each is about 2 megs (at least from the resource I can get them from). Also, not each card is the same format so a generic crop functi

[tesseract-ocr] Re: Tesseract OCR not performing well even after data cleaning and transformations on black background data

2019-01-30 Thread farhad khalafi
A few questions: Is the image you have posted the original or after you have processed? What is the image resolution? What does the extracted text look like? Any possibility of sharing the original image without redactions? On Wednesday, January 30, 2019 at 3:36:23 AM UTC-7, sett...@gmail.com

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-30 Thread Daniel Ferenc
Oh, and one more thing - the same card with the same name can appear in different editions of Magic, so pure recognition by name is not enough, I'm also training my software to recognize the edition of the card by using different means so all that in combination should be quite enough. On Wedne

Re: [tesseract-ocr] Box file layout for training tesseract4

2019-01-30 Thread tcs49
Here's google drive link to a few examples of mine: https://drive.google.com/file/d/1Bhl8nv6rRx2xu5tQx_T1Ru9dvbCyAu6H/view?usp=sharing Each textline in the image has a line in the boxfile for each character in the textline. the box dimensions following a single character are not for a single ch

[tesseract-ocr] convert a .tiff file to text file

2019-01-30 Thread George Varghese
I am using tesseract v4 to convert .tiff file to text, only the first page. The script - run from command line on Windows 2012 takes almost 8 seconds to convert only the first page. using the configuration. The cpu usage also shoots up to 80 % during that time -c tessedit_page_number=1 In re

[tesseract-ocr] Re: Announcement: introducing TesseractStudio.Net, a free Windows GUI for Tesseract 4.0

2019-01-30 Thread farhad khalafi
We have released version 1.3 of Tesseract Studio with the following enhancements: - Improved memory management to support large multi-page files. - Streaming interface to Leptonica. - Eliminate unnecessary cache of images. - Unload processed pages early. - Tested with a

[tesseract-ocr] Re: Tesseract OCR not performing well even after data cleaning and transformations on black background data

2019-01-30 Thread settysme
I have processed the image- Grayed, Resized (300 dpi), denoise using fastNlMeansDenoising. All using OpenCV 4.0.0 Suppose the text on the image reads "26 Electrical 8.34 7.47 171,637 ", my OCR reads it as "16,, 5mm -, _. - m. 16w: 111.9311" On Wednesday, January 30, 2019 at 8:39:56 PM UTC+5:30

[tesseract-ocr] How to do lstm training using box/tiff files?

2019-01-30 Thread 易鑫
Hello,everyone: I used tesseract 3.05 engine before, I have lots of tiff and box file, now I want to use tesseract 4.0.0 engine for lstm training. I want to know how to train use the tiff/box files in the new engine? Thanks in advance. -- You received this message because you are subscr

[tesseract-ocr] Tesseract Output

2019-01-30 Thread Raghav Rohilla
Hi, I wanted to ask that how can i interpret the output of the Tesseract, by this i mean that we getting same column's and page numbers etc, i wanted to know if we can tweak the output according to us or not ? And if there is any way through which we can write a python script and you know creati