[tesseract-ocr] Question about "Failed loading language"

2019-01-31 Thread nampyo hong
[image: tesseract.PNG] When I was running tesseract 3.0.4, there was no problem. I tried to install tesseract 4.0.0 in ubuntu 16.04 by building it from source, but there was an issue. I referenced https://bingrao.github.io/blog/post/2017/07/16/Install-Tesseract-4.0-in-ubuntun-16.04.html this

[tesseract-ocr] Re: convert a .tiff file to text file

2019-01-31 Thread George Varghese
Does not work in Tesseract 4. On Wednesday, January 30, 2019 at 11:34:42 AM UTC-8, George Varghese wrote: > > I am using tesseract v4 to convert .tiff file to text, only the first > page. The script - run from command line on Windows 2012 takes almost 8 > seconds to convert only the first pa

[tesseract-ocr] Tesseract for invoices

2019-01-31 Thread Shailesh Barve
Hey all, I have a requirement to process invoices and extract few data elements from it (e.g. invoice number, date, customer name, total amount). Incoming invoices are of different formats with relative positions of data elements. E.g. invoice number may be on right or to the left etc. How would

Re: [tesseract-ocr] Re: convert a .tiff file to text file

2019-01-31 Thread Zdenko Podobny
https://groups.google.com/forum/#!topic/tesseract-ocr/e3lqpY0pMpw https://groups.google.com/forum/#!topic/tesseract-ocr/UidqCx6OE0Q https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format https://github.com/jsoma/tesseract-uzn ... PS: I hope it works with tesseract 4 too ;-) I did not teste

[tesseract-ocr] Re: convert a .tiff file to text file

2019-01-31 Thread George Varghese
I am using tesseract v4.0.0.20181030 , leptonica -1.76.0 in short - using command line to convert a .tiff format to .txt file - no loop or any custom solution used. Yes the first 30 lines have the same location and I am specifying to OCR only my first page you mentioned about usage of unz f

Re: [tesseract-ocr] convert a .tiff file to text file

2019-01-31 Thread Zdenko Podobny
It is not clear for me what do you want to achieve - for me it looks it is case for custom solution with using tesseract API (C, C++, Python, maybe others). If you are can use only tesseract executable and your "30 lines" have the same location (or you know their location in advance), you can have

Re: [tesseract-ocr] Re: Tesseract not giving the desired output

2019-01-31 Thread Zdenko Podobny
see inline comments. st 30. 1. 2019 o 15:17 Lorenzo Bolzani napísal(a): > > I suppose this means that the image is always binarized, is this correct? > Yes > > Is there any way to avoid it? > Why? IMO OCR engines are running on binarized images see e.g. https://www.abbyy.com/en-eu/ocr-sdk/key-

Re: [tesseract-ocr] Should i use lstm training or TIFF/BOX file training?

2019-01-31 Thread Kristóf Horváth
Yes and as far as i know that requires different training than LSTM because in current state tesseract doesnt support that 2019. január 31., csütörtök 15:16:18 UTC+1 időpontban Timothy Snyder a következőt írta: > > When you refer to TIFF/BOX file training, do you mean manually creating > your o

Re: [tesseract-ocr] pytesseract: errors with recognized digits

2019-01-31 Thread Lorenzo Bolzani
Check the API: https://pypi.org/project/pytesseract/ There is an example under: Support for OpenCV image/NumPy array objects You may also try different languages (I had different results just on numbers). Il giorno gio 31 gen 2019 alle ore 15:18 Aaron Spell <8383...@gmail.com> ha scritto: >

Re: [tesseract-ocr] pytesseract: errors with recognized digits

2019-01-31 Thread Aaron Spell
Lorenzo Blz, thanks for your reply PSM 13 results are better than PSM 6 crop white border not give some results will try to train tesseract. *How can I send byte array to Tesseract from avoid saving and open picture to the hard disk?* среда, 30 января 2019 г., 17:25:26 UTC+3 пользователь

Re: [tesseract-ocr] Should i use lstm training or TIFF/BOX file training?

2019-01-31 Thread Timothy Snyder
When you refer to TIFF/BOX file training, do you mean manually creating your own boxfiles from your own set of images? Note that by default, lstmtraining does generate TIFF/BOX files from the fonts that you specify it to train on. With a little bit of wrangling, you can actually configure lstmtrai

[tesseract-ocr] How should i open langdata files? (for example: desired characters, eng.numbers, eng.unicharambigs)

2019-01-31 Thread Kristóf Horváth
What is the recommended format for opening and editing these kind of files? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegrou

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Lorenzo Bolzani
You can have a look at ocrd-train https://github.com/OCR-D/ocrd-train You just have to prepare cropped tiff and txt files with the same name containing a single line of text. At the same time, if you already set up everything for the font based training, I'd give it a try (time permitting): you

[tesseract-ocr] Should i use lstm training or TIFF/BOX file training?

2019-01-31 Thread Kristóf Horváth
Im planning on training tesseract to recognise sensitive information (3 letter followed by numbers, the point is to find the 3 letters so in post processing we can lock that document because it has sensitive information). While sensitive information is high priority Accuracy is key too and som

Re: [tesseract-ocr] Evaluating Tesseract with new domain-specific documents

2019-01-31 Thread Matthew Hodgskiss
Thanks very much for the advice. The ocr-evaluation tools look particularly useful On Friday, 25 January 2019 12:04:13 UTC, shree wrote: > > also see > > https://github.com/impactcentre/ocrevalUAtion > > https://github.com/Shreeshrii/ocr-evaluation-tools > > https://github.com/tesseract-ocr/test/

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Daniel Ferenc
Is there a guide somewhere how to setup training like this? How to pair the images and text, etc..? And thank you for the insight, it really is helpful. On Thursday, January 31, 2019 at 11:18:35 AM UTC+1, Lorenzo Blz wrote: > > Yes, generating text is faster and easier. > > But the real extracted

Re: [tesseract-ocr] Training for a specific wordlist and font

2019-01-31 Thread Lorenzo Bolzani
Yes, generating text is faster and easier. But the real extracted and cleaned text you are going to eventually recognize is going to be different from this, more or less depending on a lot of factors: - how similar your training font actually is - how good your cleanup will be (test this in advanc

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread Kristóf Horváth
Well you just repeated yourself and did not provide any new information. Like i said im using latest so what am i doing wrong? Also im not working in ubuntu but cygwin (not the same). 2019. január 31., csütörtök 10:57:45 UTC+1 időpontban 易鑫 a következőt írta: > > @Kristóf Horváth > Oh i see, bu

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread 易鑫
@Shree Devi Kumar: Thanks for your reply. lstm training using box/tiff files is NOT supported. Use tesstrain.sh with a UTF8 training_text and fonts. Maybe you are right.But I think using training_text will also generate tiff/box files in /tmp folder,so I think using box/tiff files and training_

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread 易鑫
@Kristóf Horváth Oh i see, but i dont know what you mean by this: you can use the master branch,latest code. I compiled the latest version on my cygwin setup so i dont know what you are refering to Sorry, I don't not say clearly.It means use master branch. I have successfully trained lstm model in

[tesseract-ocr] How to write .unicharambigs file?

2019-01-31 Thread 易鑫
Hello,everyone: I have trained a new lstm model in my project,but the result is not so good as I expected. I notice that some characters often mistake in my result. I learned that add some rules in .unicharambigs can reduce the mistakes? I extract the eng.traineddata and get the

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread Shree Devi Kumar
lstm training using box/tiff files is NOT supported. Use tesstrain.sh with a UTF8 training_text and fonts. On Thu, Jan 31, 2019 at 3:04 PM Kristóf Horváth wrote: > Oh i see, but i dont know what you mean by this: you can use the master > branch,latest code. I compiled the latest version on my c

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread Kristóf Horváth
Oh i see, but i dont know what you mean by this: you can use the master branch,latest code. I compiled the latest version on my cygwin setup so i dont know what you are refering to 2019. január 31., csütörtök 10:27:17 UTC+1 időpontban 易鑫 a következőt írta: > > Thanks for your reply. I have alrea

[tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread Kristóf Horváth
EDIT: Environment - Tesseract Version: 4.0.0 - Platform: Win10 64 (cygwin) Current Behavior: Confusing af (pls fix wiki, as soon as i can make my demo work i will have to document it so im gonna send it so you guys will be able to have a wiki)Expected Behavior: run as intended copied f

Re: [tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread 易鑫
Thanks for your reply. I have already tried to do lstm trianing on ubuntu successfully, but the result is not so good as I expected and I do not use my tiff/box file,so I want to add more sample,that's why I ask how to do lstm training using box/tiff file. as your mentioned: " > tesstrain.sh --f

[tesseract-ocr] Training tesseract tesstrain.sh exits with a warning

2019-01-31 Thread Kristóf Horváth
Currently I am trying to make sense of tesseract training and finially after days of diging finially managed to gain access to tesstrain.sh and lstmtraining commands in my cygwin. I was so happy because wiki is no help in setting up training for tesseract, but as soon as i wanted to start doi

[tesseract-ocr] Re: How to do lstm training using box/tiff files?

2019-01-31 Thread Kristóf Horváth
> > I feel you. Im currently trying to understand lstm training but wiki is >> weak as hell so im doing try and errorr blindly. So far I managed to setup >> tesseract training on cygwin so i have access to tesstrain and lstmtraining >> command. Achiving this should be your first step then i sug