As I wrote - try to search for "text detection" (or document analysis) - you will see it is quite difficult and there is almost no free/opensource solution. Something is implemented in tesseract, but ( from my experience) it fails for complex pages like you provided. That's why the documentation suggest to remove "noise" (non text elements). You can try it by cropping your image just to right (white) part and you will get significantly better results with default settings:
scanfor information and pairing Suggestions PRODUCED & BOTTLED BY SPRINGGATE® FARMS AND VINEYARD HARRISBURG, PA 17112 Www springgatevineyard.com 0812433! l GOVERNMENT WARNING: 1) ACCORDING 70 THE SURGEON GENERAL, WOMEN SHOULD NOT RINK ALCOHOLIC BEVERAGES DURNG PREGNANCY BECAUSE OF THE RISK OF BIRTH DEFECTS. (2) CONSUMPTION OF ALCOHOUC BEVERAGES INPARS YOUR ABLITY TODRNE ACAROR OPERATE MACHINERY, AND MAY CAUSE HEALTH PROBLEMS. CONTAINS SULFTES There are still some problems (e.g. "I") but there are IMO related to quality of image so you can not solve them with preprocessing (maybe post processing with spellchecker would be a solution if you can not get better input). Zdenko pi 22. 10. 2021 o 15:44 Schuyler Reinken <xarly...@gmail.com> napísal(a): > We already use python opencv2 to convert the image to remove color and do > binarisation. I also tried to use erosion, but it showed no marked > improvement. Now for this particular image it would be easy to remove the > left side, but it is merely a sample and the text can occur in any part of > the image in the actual application we are building. When you say OCR only > text areas, does that mean you can run tesseract once in a different page > segmentation mode to just create a bounding box, then run it again to > actually get the text accurately? > > On Friday, October 22, 2021 at 12:56:51 AM UTC-4 zdenop wrote: > >> Generally: read and follow >> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md >> >> Basically: pre-process image: remove not text element, or OCR only text >> areas (search internet for "text detection") >> >> Zdenko >> >> >> št 21. 10. 2021 o 23:34 Schuyler Reinken <xarl...@gmail.com> napísal(a): >> >>> I'm using the english tessdata_best on linux >>> >>> On Thursday, October 21, 2021 at 5:32:17 PM UTC-4 Schuyler Reinken wrote: >>> >>>> I am using tesseract 4.1.1 and the results on this Image are as follows: >>>> ----------------------------------------------------- >>>> roan >>>> nian >>>> Er >>>> Preferred i) >>>> PRODUCED & wa >>>> SPRINGGATES >>>> FARMS AND VINEYARD >>>> Le >>>> 1 >>>> Tome Son a Woon >>>> Hui Sov vet Aoinii >>>> BEVERAGES UF >>>> a i od oR De pa 1 >>>> primi ett >>>> ‘OPERATE MACHNERY, AND MAY CAUSE >>>> 375 mL 7% ALC BY VOL REATH PROBES. COMANSSUFTES >>>> Jon 2 To 5 GIP \Y » ) SIR VW, T=" Wa COO pn a TEES gemma >>>> >>>> ------------------------------------------------------------------------------------------------------------- >>>> On Friday, October 15, 2021 at 10:30:10 AM UTC-4 Schuyler Reinken wrote: >>>> >>>>> >>>>> Hello! I am having trouble using Tesseract to read inconsistently >>>>> spaced text. >>>>> >>>>> It tends to miss entire lines of text in the government warning in >>>>> image attached. I don't need to read the blue angled text, only the stuff >>>>> on the white sidebar. Is there a way to improve it's reading of this sort >>>>> of image? >>>>> [image: SPRING GATE VINEYARD_a.jpg] >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/123a18f9-c281-4063-b197-45a9a35e6090n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/123a18f9-c281-4063-b197-45a9a35e6090n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/dfaeda97-e182-4553-ba02-72a6aa8d7fa7n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/dfaeda97-e182-4553-ba02-72a6aa8d7fa7n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z1%2BRmH6taywp8JafZfjQJf2zypAcLg0s7nq%3D0KzqzGVw%40mail.gmail.com.