Re: [tesseract-ocr] Re: Reading Inconsistently Spaced Text on a busy image

Zdenko Podobny Fri, 22 Oct 2021 11:14:32 -0700

As I wrote - try to search for "text detection" (or document analysis) - you
will see it is quite difficult and there is almost no free/opensource
solution.
Something is implemented in tesseract, but ( from my experience) it fails
for complex pages like you provided. That's why the documentation suggest
to remove "noise" (non text elements). You can try it by cropping your
image just to right (white) part and you will get significantly better
results with default settings:


scanfor
information
and pairing
Suggestions



PRODUCED & BOTTLED BY
SPRINGGATE®
FARMS AND VINEYARD
HARRISBURG, PA 17112
Www springgatevineyard.com

0812433! l
GOVERNMENT WARNING: 1) ACCORDING
70 THE SURGEON GENERAL, WOMEN
SHOULD NOT RINK ALCOHOLIC
BEVERAGES DURNG PREGNANCY BECAUSE
OF THE RISK OF BIRTH DEFECTS. (2)
CONSUMPTION OF ALCOHOUC BEVERAGES
INPARS YOUR ABLITY TODRNE ACAROR
OPERATE MACHINERY, AND MAY CAUSE
HEALTH PROBLEMS. CONTAINS SULFTES

There are still some problems (e.g. "I") but there are IMO related to
quality of image so you can not solve them with preprocessing (maybe post
processing with spellchecker would be a solution if you can not get better
input).

Zdenko


pi 22. 10. 2021 o 15:44 Schuyler Reinken <xarly...@gmail.com> napísal(a):

> We already use python opencv2 to convert the image to remove color and do
> binarisation. I also tried to use erosion, but it showed no marked
> improvement. Now for this particular image it would be easy to remove the
> left side, but it is merely a sample and the text can occur in any part of
> the image in the actual application we are building. When you say OCR only
> text areas, does that mean you can run tesseract once in a different page
> segmentation mode to just create a bounding box, then run it again to
> actually get the text accurately?
>
> On Friday, October 22, 2021 at 12:56:51 AM UTC-4 zdenop wrote:
>
>> Generally: read and follow
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
>>
>> Basically: pre-process image: remove not text element, or OCR only text
>> areas (search internet for "text detection")
>>
>> Zdenko
>>
>>
>> št 21. 10. 2021 o 23:34 Schuyler Reinken <xarl...@gmail.com> napísal(a):
>>
>>> I'm using the english tessdata_best on linux
>>>
>>> On Thursday, October 21, 2021 at 5:32:17 PM UTC-4 Schuyler Reinken wrote:
>>>
>>>> I am using tesseract 4.1.1 and the results on this Image are as follows:
>>>> -----------------------------------------------------
>>>> roan
>>>> nian
>>>> Er
>>>> Preferred i)
>>>> PRODUCED & wa
>>>> SPRINGGATES
>>>> FARMS AND VINEYARD
>>>> Le
>>>> 1
>>>> Tome Son a Woon
>>>> Hui Sov vet Aoinii
>>>> BEVERAGES UF
>>>> a i od oR De pa 1
>>>> primi ett
>>>> ‘OPERATE MACHNERY, AND MAY CAUSE
>>>> 375 mL 7% ALC BY VOL REATH PROBES. COMANSSUFTES
>>>> Jon 2 To 5 GIP \Y » ) SIR VW, T=" Wa COO pn a TEES gemma
>>>>
>>>> -------------------------------------------------------------------------------------------------------------
>>>> On Friday, October 15, 2021 at 10:30:10 AM UTC-4 Schuyler Reinken wrote:
>>>>
>>>>>
>>>>> Hello! I am having trouble using Tesseract to read inconsistently
>>>>> spaced text.
>>>>>
>>>>> It tends to miss entire lines of text in the government warning in
>>>>> image attached. I don't need to read the blue angled text, only the stuff
>>>>> on the white sidebar. Is there a way to improve it's reading of this sort
>>>>> of image?
>>>>> [image: SPRING GATE VINEYARD_a.jpg]
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/123a18f9-c281-4063-b197-45a9a35e6090n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/123a18f9-c281-4063-b197-45a9a35e6090n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/dfaeda97-e182-4553-ba02-72a6aa8d7fa7n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/dfaeda97-e182-4553-ba02-72a6aa8d7fa7n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z1%2BRmH6taywp8JafZfjQJf2zypAcLg0s7nq%3D0KzqzGVw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Reading Inconsistently Spaced Text on a busy image

Reply via email to