Hi,

There are some image preprocessing you can attempt

1. Binary convert (adjust the threshold and its parameters)
2. Image resizing (OCR works best for pixels =>300
3. Dilation and Erosion - Adjust the text boundaries sizes so that OCR can
read it better
4. Adjust the parameters 'oem' and 'psm' for your image_to_string function
of OCR

Thanks

On Fri, 16 Sep, 2022, 6:34 pm Gabriel Sousa, <gabr...@gsousa.com.br> wrote:

> Thank you so much for the reply!
>
> It really helped me to know which path to take!
> I have already taken some of those steps, but now I know for a fact that
> I'm not crazy! hahaha
>
> 1 - It's a different version, BUT my coworker is running the same version
> of the container and get's the same results I get - even tho my verision is
> 5.0 and his is 4.1
> 2 - I will update that - great idea!
> 3 - I already have the sample data to improve image cropping, thank you!
> 4 - This is the next step to take
> 5 - What do you mean by preprocessing? I've noticed that the cropped image
> looks like it has it's Font changed, as the spacing between letters of the
> cropped image is much bigger than the original PDF. (Btw, the PDF we are
> working with is not image based, it's text, what we call here a "compiled"
> into text type of PDF).
>
> Once again, thank you so much for the comments!
>
> Have a good one!
>
> On Friday, September 16, 2022 at 8:44:28 AM UTC-3 vcj...@gmail.com wrote:
>
>> Hi
>> This happens normally. The one properly working in local machine may not
>> work as expected when dockerise. Please check following
>>
>> 1.Tesseract version you build in Docker file is same as in local machine.
>> 2.If the input is pdf, try changing the version of pdf2image library in
>> requirement file.
>> 3.Enable debugging, build docker image in local machine, open the
>> container and copy the debugging images to local machine. Compare these
>> with the debug images which are taken from local code run.
>> 4.Now you can identify which stage is having issues and start working on
>> that.
>> 5.Most likely some preprocessing corrections will give proper results in
>> docker
>>
>> All the  best
>>
>>
>> On Fri, 16 Sep, 2022, 12:06 pm Gabriel Sousa, <gab...@gsousa.com.br>
>> wrote:
>>
>>> Hi there,
>>>
>>> I'm new to this group, and also new to using Tesseract in general.
>>>
>>> We use py-tesseract for a few data extraction, not many cases, at the
>>> company I work for and for no apparent reason, tesseract text extraction
>>> stopped working from one deploy to another.
>>>
>>> It should extract a word such as 'JANUARY' - and it used to do this just
>>> fine, but now it reads 'J ANUARY'. The service is running on a docker
>>> container. So everything is using the same version from the last time all
>>> the tests passed, and now it just breaks when testing something that was
>>> not altered at all.
>>>
>>> Not sure if this is even possible, to me this just seems impossible, but
>>> I've went as far as I can in terms of checking code and dependencies and
>>> nothing fixes the problem. Outside the container, everything works just
>>> fine, tesseract only returns a wrong value INSIDE the container...
>>>
>>> Anyways... any thoughts regarding how to look for a fix or any ideias
>>> overall are welcomed.
>>>
>>> Thank you
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/90c76920-8375-4c28-9d99-8f8f8249a14fn%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/90c76920-8375-4c28-9d99-8f8f8249a14fn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2e52e0d8-27e3-41b4-988e-64ffe1c04f5en%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/2e52e0d8-27e3-41b4-988e-64ffe1c04f5en%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGy9PENxBT8srQ4Ru7g8GN%2BvW86MKjYHB_QZCbZcNA7t7WMXDA%40mail.gmail.com.

Reply via email to