[tesseract-ocr] Word confidence versus symbol confidence

2018-10-11 Thread farhad khalafi
I am totally puzzled with how the confidence reported at Word level relates to the confidences assigned to the characters of the same word. I used the attached TIFF image to recognize a simple MICR line of a check. The recognized text had two words: 495096 70b01b205xX0eL7010717 The c

Re: [tesseract-ocr] Word confidence versus symbol confidence

2018-10-12 Thread farhad khalafi
Dasgupta wrote: > > Could you tell how did you get the confidence percentiles? I would like to > know that. :) > > On Fri, Oct 12, 2018 at 10:10 AM farhad khalafi > wrote: > >> I am totally puzzled with how the confidence reported at Word level >> relates to the confi

Re: [tesseract-ocr] Word confidence versus symbol confidence

2018-10-12 Thread farhad khalafi
/// /// Gets the confidence percentile for the current element at the specified level. /// public float GetConfidence(PageIteratorLevel level) { return TessApi.TessResultIteratorConfidence(Handle, level); } I use a custom .NET layer s

[tesseract-ocr] Announcement: introducing TesseractStudio.Net, a free Windows GUI for Tesseract 4.0

2018-11-08 Thread farhad khalafi
ge PDF components, such as visible text. - Integrated spell checker. Download: https://github.com/OpaitSoftware/TesseractStudio.Net Thank you, Farhad Khalafi Opait Software -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsub

[tesseract-ocr] OCR of FAX Images

2018-12-04 Thread farhad khalafi
Hello, Some older fax machines used different DPI in horizontal and vertical directions. This often resulted in images with jagged lines as in the file that I have attached to this post. I am trying to find a way to smooth these out to improve OCR accuracy. The images are already in 1-bit mono

[tesseract-ocr] Re: OCR of FAX Images

2018-12-06 Thread farhad khalafi
Thanks for your suggestion. Do you know the specific ImageMagick commands that will achieve this shifting of lines? One idea I had was to fill in small holes that were bounded on three sides and also remove speckles that were attached on only one side. But this will probably distort the charac

[tesseract-ocr] Re: Announcement: introducing TesseractStudio.Net, a free Windows GUI for Tesseract 4.0

2019-01-09 Thread farhad khalafi
expression based rules engine. - UI improvements and several bug fixes. Download: https://github.com/OpaitSoftware/TesseractStudio.Net > > Thank you, > > Farhad Khalafi > Opait Software > -- You received this message because you are subscribed to the Google Groups "t

Re: [tesseract-ocr] Tesseract's binarization

2019-01-26 Thread farhad khalafi
I believe Tesseact uses Otsu algorithm to find a single threshold value. The threshold is used to set gray pixels to binary values depending on whether a gray level is below or above the threshold. I use Leptonica for my own pre-OCR image processing and the performance seems to be fine. On Sat,

[tesseract-ocr] Re: Tesseract OCR not performing well even after data cleaning and transformations on black background data

2019-01-30 Thread farhad khalafi
wrote: > > > > On Wednesday, January 30, 2019 at 2:49:49 PM UTC+5:30, farhad khalafi > wrote: >> >> @Smriti: In the latest version (1.3.0) of our free Tesseract Studio >> <https://github.com/OpaitSoftware/TesseractStudio.Net>, we have an >> experim

[tesseract-ocr] Re: Announcement: introducing TesseractStudio.Net, a free Windows GUI for Tesseract 4.0

2019-01-30 Thread farhad khalafi
the PDF file (with some overhead). - Format OCR and other text to approximate the original layout (without graphics). - Some bug fixes. Download: https://github.com/OpaitSoftware/TesseractStudio.Net > > Thank you, > > Farhad Khalafi > Opait Software >

Re: [tesseract-ocr] Spacing extracted text horizontally

2019-03-05 Thread farhad khalafi
Please provide a sample. If you are only extracting text from the image, the text itself will have the words separated by single spaces. If, however, you are extracting block elements (e.g. individual words), they come with attributes such as Text, Font size, Bounding rectangle, etc. Using the bo

[tesseract-ocr] Re: Announcement: introducing TesseractStudio.Net, a free Windows GUI for Tesseract 4.0

2019-03-26 Thread farhad khalafi
Studio.Net Thank you, Farhad Khalafi Opait Software -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post

[tesseract-ocr] Re: Announcement: introducing TesseractStudio.Net, a free Windows GUI for Tesseract 4.0

2019-05-01 Thread farhad khalafi
. This version avoids the temporaries for better performance. Download: https://github.com/OpaitSoftware/TesseractStudio.Net Thank you, Farhad Khalafi Opait Software -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe

Re: [tesseract-ocr] Multiple jpg files into 1 editable pdf

2019-05-08 Thread farhad khalafi
One option is to first convert the 200 images into a single PDF file and then run the PDF through TesseractStudio.Net which will OCR all pages and generate a searchable PDF file with various options. Make sure that the scanning is in at least 200 (prefarably 300) D

[tesseract-ocr] DISABLED_LEGACY_CODE questions

2019-07-26 Thread farhad khalafi
I have also a few questions: 1. Is there be a function similar to DetectOrientationScript in the new engine? That function is excluded when DISABLED_LEGACY_CODE is set. 2. I know that you can get detected orientation, writing direction, etc. when you use segmentation mode 1 (PSM_AUTO_OSD) and d

Re: [tesseract-ocr] DISABLED_LEGACY_CODE questions

2019-07-27 Thread farhad khalafi
Leptonica has also function for deskew (look for pixDeskew) and > dewarp (dewarpSinglePage) of images... > > > Zdenko > > > so 27. 7. 2019 o 0:21 farhad khalafi > napĂ­sal(a): > >> I have also a few questions: >> >> 1. Is there be a function similar to Det

[tesseract-ocr] Re: What does word level confidence of zero mean?

2019-08-18 Thread farhad khalafi
I tried the gImageReader utility with similar results. A screenshot is attached. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@goog

Re: [tesseract-ocr] Re: How to process PDF files line by line with tesseract

2019-11-08 Thread farhad khalafi
I also experienced a similar problem with images especially if they used fixed-pitch fonts (older scanned documents often did). Tesseract groups characters vertically assuming rotated text. I used PSM 6 instead of 3 with some improvement, but it did miss significant portions of text in return. I wa

[tesseract-ocr] Page Orientation Question

2020-03-30 Thread farhad khalafi
Tesseract API has the following two functions: TessBaseAPIDetectOrientationScript TessPageIteratorOrientation I need to figure the overall page orientation for an image and am a bit confused about these two functions. When is each more appropriate to use? I am accessing these with pInvoke from

Re: [tesseract-ocr] Tesseract speed

2020-05-07 Thread farhad khalafi
3-4 seconds for a single page is probably not that slow depending on the page content and layout. We have a huge OCR project with approximately 16 million images to process. Our configuration has 4 virtual machines each with 8 cores and 16GB memory. They work against a single input queue and p