[tesseract-ocr] Training the LSTM language model explicitly in an unsupervised manner

2018-10-15 Thread Rahul Tyagi
Hi, I am trying to run tesseract-ocr on invoices to detect user ID's, Invoice numbers, tax codes etc. I think tesseract has not been trained on this kind of data so i need to fine tune the network on my data. Now it will be a bit difficult for me to get labelled data to fine tune tesseract as s

Re: [tesseract-ocr] Training the LSTM language model explicitly in an unsupervised manner

2018-10-15 Thread Soumik Ranjan Dasgupta
No, tesseract cannot be trained in an unsupervised manner, it needs ground truth labels to train from scratch or fine-tune. Please provide a sample image to test if possible. On Mon, Oct 15, 2018 at 12:38 PM Rahul Tyagi wrote: > Hi, > > I am trying to run tesseract-ocr on invoices to detect user

Re: [tesseract-ocr] Re: Heads up: release of tesseract 4.0

2018-10-15 Thread Soumik Ranjan Dasgupta
Is there any way tesseract could be installed using pip for Ubuntu 16.04 systems and above? On Sun, Oct 14, 2018 at 11:46 PM Zdenko Podobny wrote: > it will depends based on number of (significant) commits and findings ;-) > E.g. just yesterday we got fixes for Mac and it is still not clear if >

Re: [tesseract-ocr] Re: Heads up: release of tesseract 4.0

2018-10-15 Thread Zdenko Podobny
Are familiar with tools you try to use? pip is for distribution python modules and tesseract is c++ project, that are distributed with other tools (depending on linux distribution) - on Ubuntu it should be apt. Zdenko po 15. 10. 2018 o 10:09 Soumik Ranjan Dasgupta napísal(a): > Is there any wa

Re: [tesseract-ocr] Re: Heads up: release of tesseract 4.0

2018-10-15 Thread Soumik Ranjan Dasgupta
Didn't know that, sorry. Thank you for the information. In that case, would it be possible to find a way to install tesseract via apt on Ubuntu 16.04 systems? On Mon, Oct 15, 2018, 2:00 PM Zdenko Podobny wrote: > Are familiar with tools you try to use? > pip is for distribution python modules an

Re: [tesseract-ocr] Re: Heads up: release of tesseract 4.0

2018-10-15 Thread Zdenko Podobny
read the forum, and wiki ;-) It is already there. Zdenko po 15. 10. 2018 o 10:32 Soumik Ranjan Dasgupta napísal(a): > Didn't know that, sorry. Thank you for the information. > In that case, would it be possible to find a way to install tesseract via > apt on Ubuntu 16.04 systems? > > On Mon, O

Re: [tesseract-ocr] Training the LSTM language model explicitly in an unsupervised manner

2018-10-15 Thread Rahul Tyagi
[image: 1_7wBhusJmIwkiwV-J3LJ7lw.png] I am not trying to train the whole model in an unsupervised way, I just want to train the language model which act as the final layer of tesseract to generate variable length sequence, this will act like a *pre-training* step. Just like other language mode

Re: [tesseract-ocr] Empty page!!

2018-10-15 Thread flaviumarc
Thank you, now is working (tesseract c:\Flaviu\imagine.png C:\Flaviu\output --psm 13) On Friday, October 12, 2018 at 4:18:34 PM UTC+3, zdenop wrote: > > You got it because you forget to read manual/documenation to tool you try > to use :-). > You can start with tesseract --help, --help-extra et

Re: [tesseract-ocr] Convert image to text shows arrow instead of empty string

2018-10-15 Thread Magdalena Orzechowska
Actually when You open out.txt file in Notepad it's not empty. There is an arrow there. The same arrow appears in PyCharm output. Previously it was empty. niedz., 14 paź 2018 o 12:30 Soumik Ranjan Dasgupta napisał(a): > The image you provided does not have any text to perform OCR in the > first

Re: [tesseract-ocr] Not getting results with numbers and currency simbols in tables

2018-10-15 Thread Lorenzo Bolzani
Just a small note (in case someone will land on this thread): I recently found out that PSM 7 and others work better than 13. See: https://github.com/tesseract-ocr/tesseract/issues/1778#issuecomment-429527692 Il giorno mar 31 lug 2018 alle ore 11:30 Lorenzo Bolzani < l.bolz...@gmail.com> ha scrit

Re: [tesseract-ocr] Convert image to text shows arrow instead of empty string

2018-10-15 Thread Soumik Ranjan Dasgupta
I don't see any arrows opening it with gedit, just a symbol. I tried opening the file with python and reading the contents. Pasting the results below >>> f = open("out.txt",'r') >>> s = f.readline() >>> s '\x0c' Let me know if this helps. Can anyone else confirm this? On Mon, Oct 15, 2018 a

Re: [tesseract-ocr] Convert image to text shows arrow instead of empty string

2018-10-15 Thread Zdenko Podobny
it is page line separator or form feed. See https://en.wikipedia.org/wiki/Page_break#Form_feed Zdenko po 15. 10. 2018 o 13:15 Soumik Ranjan Dasgupta napísal(a): > I don't see any arrows opening it with gedit, just a symbol. > I tried opening the file with python and reading the contents. Past

[tesseract-ocr] Multiple Languages

2018-10-15 Thread Mariam Hijazi
Does tesseract support recognize multiple language in one document ? and how would do that ? Regards. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract

Re: [tesseract-ocr] Multiple Languages

2018-10-15 Thread Adrian Owen
Just list locales using + delimiter. Sent from my Huawei Mobile Original Message Subject: [tesseract-ocr] Multiple Languages From: Mariam Hijazi To: tesseract-ocr CC: Does tesseract support recognize multiple language in one document ? and how would do that ? Regards. -- You

RE: [tesseract-ocr] Multiple Languages

2018-10-15 Thread MariamHi
I did this but I have Bad recognition for English word .. what is the accuracy for multiple languages and how to improve it ? From: Adrian Owen Sent: Monday, October 15, 2018 3:35 PM To: tesseract-ocr Subject: Re: [tesseract-ocr] Multiple Languages Just list locales using + delimiter. Sent from

RE: [tesseract-ocr] Multiple Languages

2018-10-15 Thread Adrian Owen
Gimp is your friend: https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy If your programming, use KalikoImage library to replicate manual GIMP steps, that’s easy. I found greyscale didn’t help. YES: Long line removal (may not apply to you) (OpenCV) YES

[tesseract-ocr] Why do I get such poor results from Tesseract for simple single character recognizing?

2018-10-15 Thread 'Yuliana Zigangirova' via tesseract-ocr
Hi everyone, I am trying to use Tesseract for single character recognizing and the results are awful. "h" is recognized as "n", "4" as "/i", "O" as "()"; [image: 1testtiff.png] [image: 6testtiff.png] [image: 2testtiff.png] Single character mode seems not to act, as many characters are

[tesseract-ocr] New JPN_VERT traineddata (for 4.0)

2018-10-15 Thread Seokbong Choi
Hello all, During 2 weeks, I trained JPN_VERT little bit further. I included heart symbols, which are commonly used in Japanese comic books. Whenever I tried to OCR, the entire sentence got weird. So, I got around the issue by training those symbols. I also trained casual conversations more. The

Re: [tesseract-ocr] Why do I get such poor results from Tesseract for simple single character recognizing?

2018-10-15 Thread Lorenzo Bolzani
Try to use psm 7 or 13 (SINGLE_LINE and RAW_LINE). In my case 7 works best. I'm not 100% sure but it may be easier to recognize full words rather than single characters. But I do not know if this is just a test or if this is what you need to do. The default oem mode (lstm) should be the best, but

Re: [tesseract-ocr] New JPN_VERT traineddata (for 4.0)

2018-10-15 Thread Shree Devi Kumar
Thank you for sharing. It will be helpful if you add this info to the readme file in your github repo also. Please share the training options that you used, number of fonts, iterations etc. It will be useful as reference . On Mon, 15 Oct 2018, 17:27 Seokbong Choi, wrote: > Hello all, > > Durin

[tesseract-ocr] GUI for Tesseract

2018-10-15 Thread Mugunthan
Hi, How can I develop a GUI Application with my traineddata files. I've trained in LSTM and 3.05 and need to embed in a desktop application. How can I do that?? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group

[tesseract-ocr] how disable diacritics recognition in tesseract 4.0

2018-10-15 Thread Fahad Al-Saidi
Hi, how I can disable diacritics recognition in tesseract 4? is there any option for it? Thanks, Fahad -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesser

Re: [tesseract-ocr] Why do I get such poor results from Tesseract for simple single character recognizing?

2018-10-15 Thread Zdenko Podobny
1. If you have quality problem - it good to play with tesseract executable instead of API ;-) 2. It is know that passing text (in your case just one letter) is not best idea - please try to add small white border e.g. 10 px 3. Please set dpi for image after SetImage See attachment f

Re: [tesseract-ocr] Making custom traineddata

2018-10-15 Thread Vinod Gattani
Hi All, I have started a project to do OCR on Identity Cards. I am learning to train tesseract models with custom fonts. Please help me on this. Steps till now: 1. git pull https://github.com/tesseract-ocr/tesseract 2. Then I followed instructions on training package till command "sudo make t

Re: [tesseract-ocr] Making custom traineddata

2018-10-15 Thread Robert Kamiński
Hi, " Why the version is 4.0." What do you mean by that? In logs it states that it's 3.04v. "Tesseract Open Source OCR Engine v3.04.01 with Leptonica". The problem might be the fact that 4th version is using lstm files whereas you have version 3.04 using box files instead. Try to check the version

Re: [tesseract-ocr] Making custom traineddata

2018-10-15 Thread Vinod Gattani
Hi, Typo: " Why the version is not 4.0.? I installed using "git pull https://github.com/tesseract-ocr/tesseract";. And then followed the instructions on training page. Regards On Tue, Oct 16, 2018 at 11:53 AM Robert Kamiński < kaminski.robert...@gmail.com> wrote: > Hi, > " Why the version is 4.0

Re: [tesseract-ocr] Making custom traineddata

2018-10-15 Thread Zdenko Podobny
Robert is pointing you to right direction. Did you read the log you post here? " Tesseract Open Source OCR Engine v3.04.01 with Leptonica" You are mixing tesseract versions so no surprise of problems. Zdenko ut 16. 10. 2018 o 8:26 Vinod Gattani napísal(a): > Hi, > Typo: " Why the version is no

Re: [tesseract-ocr] Making custom traineddata

2018-10-15 Thread Vinod Gattani
Robert/ Zdenko Yes, in the log I see version "3.4v". To install v4, I used the link "https://github.com/tesseract-ocr/tesseract";. I thought it has tesseract v4, as the Readme file say "Source code for the new LSTM based 4.0 version is available from the master branch on GitHub." So, I did a git