Re: [tesseract-ocr] Tessarct changed behaviour inside docker

Gabriel Sousa Wed, 21 Sep 2022 04:09:46 -0700

Thank you so much for the tips and the support!

We will work on implementing those improvements on OCR asap...


Thank you very much!!

On Friday, September 16, 2022 at 12:06:19 PM UTC-3 [email protected] wrote:

> Hi,
>
> There are some image preprocessing you can attempt
>
> 1. Binary convert (adjust the threshold and its parameters)
> 2. Image resizing (OCR works best for pixels =>300
> 3. Dilation and Erosion - Adjust the text boundaries sizes so that OCR can 
> read it better
> 4. Adjust the parameters 'oem' and 'psm' for your image_to_string function 
> of OCR
>
> Thanks
>
> On Fri, 16 Sep, 2022, 6:34 pm Gabriel Sousa, <[email protected]> wrote:
>
>> Thank you so much for the reply!
>>
>> It really helped me to know which path to take!
>> I have already taken some of those steps, but now I know for a fact that 
>> I'm not crazy! hahaha
>>
>> 1 - It's a different version, BUT my coworker is running the same version 
>> of the container and get's the same results I get - even tho my verision is 
>> 5.0 and his is 4.1
>> 2 - I will update that - great idea!
>> 3 - I already have the sample data to improve image cropping, thank you!
>> 4 - This is the next step to take
>> 5 - What do you mean by preprocessing? I've noticed that the cropped 
>> image looks like it has it's Font changed, as the spacing between letters 
>> of the cropped image is much bigger than the original PDF. (Btw, the PDF we 
>> are working with is not image based, it's text, what we call here a 
>> "compiled" into text type of PDF).
>>
>> Once again, thank you so much for the comments!
>>
>> Have a good one!
>>
>> On Friday, September 16, 2022 at 8:44:28 AM UTC-3 [email protected] wrote:
>>
>>> Hi
>>> This happens normally. The one properly working in local machine may not 
>>> work as expected when dockerise. Please check following
>>>
>>> 1.Tesseract version you build in Docker file is same as in local machine.
>>> 2.If the input is pdf, try changing the version of pdf2image library in 
>>> requirement file.
>>> 3.Enable debugging, build docker image in local machine, open the 
>>> container and copy the debugging images to local machine. Compare these 
>>> with the debug images which are taken from local code run.
>>> 4.Now you can identify which stage is having issues and start working on 
>>> that.
>>> 5.Most likely some preprocessing corrections will give proper results in 
>>> docker
>>>
>>> All the  best
>>>
>>>
>>> On Fri, 16 Sep, 2022, 12:06 pm Gabriel Sousa, <[email protected]> 
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'm new to this group, and also new to using Tesseract in general.
>>>>
>>>> We use py-tesseract for a few data extraction, not many cases, at the 
>>>> company I work for and for no apparent reason, tesseract text extraction 
>>>> stopped working from one deploy to another.
>>>>
>>>> It should extract a word such as 'JANUARY' - and it used to do this 
>>>> just fine, but now it reads 'J ANUARY'. The service is running on a docker 
>>>> container. So everything is using the same version from the last time all 
>>>> the tests passed, and now it just breaks when testing something that was 
>>>> not altered at all.
>>>>
>>>> Not sure if this is even possible, to me this just seems impossible, 
>>>> but I've went as far as I can in terms of checking code and dependencies 
>>>> and nothing fixes the problem. Outside the container, everything works 
>>>> just 
>>>> fine, tesseract only returns a wrong value INSIDE the container...
>>>>
>>>> Anyways... any thoughts regarding how to look for a fix or any ideias 
>>>> overall are welcomed.
>>>>
>>>> Thank you
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/90c76920-8375-4c28-9d99-8f8f8249a14fn%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/90c76920-8375-4c28-9d99-8f8f8249a14fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2e52e0d8-27e3-41b4-988e-64ffe1c04f5en%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2e52e0d8-27e3-41b4-988e-64ffe1c04f5en%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/711dab23-2425-4424-ba22-a384b0fdc141n%40googlegroups.com.

Re: [tesseract-ocr] Tessarct changed behaviour inside docker

Reply via email to