[tesseract-ocr] Re: Different behaviours with the same (part of) image.

2016-06-06 Thread Ashish Goel
I generally do image resizing to help me to correct errors like this. For ex, for your test1.bmp, I did: *convert test1.bmp -resize 400% testnew.bmp* I used imagemagick to resize the image. After this, tesseract identified ':' correctly. Though sometimes, image resizing introduces some other

Re: [tesseract-ocr] Different behaviours with the same (part of) image.

2016-06-06 Thread Allistair
I do not have a technical reason for you but I confirm that Tesseract is sensitive to padding around words you are trying to detect (perhaps something about its page segmentation). Best to make sure text has enough white space around it in my experience. On 6 June 2016 at 18:23, 'Carlo' via tesser

Re: [tesseract-ocr] Re: Why do I get such bad results?

2016-06-06 Thread ashish goel
I had same problem for Swedish language and a temporary workaround helped me. I zoomed (re-scaled) image to 400% and it recognized the letter. (Though it added other problems). Not sure, but it could improve results for you. Ashish On Mon, Jun 6, 2016 at 8:53 PM, Tom Morris wrote: > On Monday,

Re: [tesseract-ocr] Getting a blank tessinput.tif file

2016-06-06 Thread ashish goel
I am trying to process a png image. Will it work, if I convert my png to tiff before OCRing? On Mon, Jun 6, 2016 at 5:28 PM, Zdenko Podobný wrote: > Your leptonica build support only limited number of image formats. What > image you try to process? > > Zdenko > > On Mon, Jun 6, 2016 at 1:08 PM,

[tesseract-ocr] Re: Why do I get such bad results?

2016-06-06 Thread Tom Morris
On Monday, June 6, 2016 at 3:17:29 AM UTC-4, Doron Saar wrote: > > > I'm trying to train Tesseract to work with a large library of Hebrew > language documents. > Why? Did you get unacceptable results with the standard Hebrew language data? https://github.com/tesseract-ocr/tessdata/blob/master/h

Re: [tesseract-ocr] Getting a blank tessinput.tif file

2016-06-06 Thread Zdenko Podobný
Your leptonica build support only limited number of image formats. What image you try to process? Zdenko On Mon, Jun 6, 2016 at 1:08 PM, Ashish Goel wrote: > Hello All, > > I am trying to do OCR on a bunch of images. Getting some failures, and I > want to analyse them. > So, to do that, I am tr

[tesseract-ocr] Getting a blank tessinput.tif file

2016-06-06 Thread Ashish Goel
Hello All, I am trying to do OCR on a bunch of images. Getting some failures, and I want to analyse them. So, to do that, I am trying to get the tessinput.tif file so that I can find out what input actually goes to tesseract. I am passing "-c tessedit_write_images 1" along with my tesseract to

[tesseract-ocr] Re: Why do I get such bad results?

2016-06-06 Thread Doron Saar
I just get the same mistakes all the time. The letter ו is often read as ט The letter נ is often read as ) and so on. When I add more training data files I just get worse results instead of better results. On Monday, June 6, 2016 at 1:51:45 PM UTC+3, Ashish Goel wrote: > > If you can elaborat

[tesseract-ocr] Re: Why do I get such bad results?

2016-06-06 Thread Ashish Goel
If you can elaborate on what kind of failures you are experiencing, people might be able to help. On Monday, June 6, 2016 at 12:47:29 PM UTC+5:30, Doron Saar wrote: > > Hi, > > I'm trying to train Tesseract to work with a large library of Hebrew > language documents. > They are all in good qual

[tesseract-ocr] Why do I get such bad results?

2016-06-06 Thread Doron Saar
Hi, I'm trying to train Tesseract to work with a large library of Hebrew language documents. They are all in good quality scanning, black and white, and most of them have the same font and character size. The hebrew alphabet should be relatively very simple for OCR: 27 characters, no Upper/Low