Hi, There are some image preprocessing you can attempt
1. Binary convert (adjust the threshold and its parameters) 2. Image resizing (OCR works best for pixels =>300 3. Dilation and Erosion - Adjust the text boundaries sizes so that OCR can read it better 4. Adjust the parameters 'oem' and 'psm' for your image_to_string function of OCR Thanks On Fri, 16 Sep, 2022, 6:34 pm Gabriel Sousa, <gabr...@gsousa.com.br> wrote: > Thank you so much for the reply! > > It really helped me to know which path to take! > I have already taken some of those steps, but now I know for a fact that > I'm not crazy! hahaha > > 1 - It's a different version, BUT my coworker is running the same version > of the container and get's the same results I get - even tho my verision is > 5.0 and his is 4.1 > 2 - I will update that - great idea! > 3 - I already have the sample data to improve image cropping, thank you! > 4 - This is the next step to take > 5 - What do you mean by preprocessing? I've noticed that the cropped image > looks like it has it's Font changed, as the spacing between letters of the > cropped image is much bigger than the original PDF. (Btw, the PDF we are > working with is not image based, it's text, what we call here a "compiled" > into text type of PDF). > > Once again, thank you so much for the comments! > > Have a good one! > > On Friday, September 16, 2022 at 8:44:28 AM UTC-3 vcj...@gmail.com wrote: > >> Hi >> This happens normally. The one properly working in local machine may not >> work as expected when dockerise. Please check following >> >> 1.Tesseract version you build in Docker file is same as in local machine. >> 2.If the input is pdf, try changing the version of pdf2image library in >> requirement file. >> 3.Enable debugging, build docker image in local machine, open the >> container and copy the debugging images to local machine. Compare these >> with the debug images which are taken from local code run. >> 4.Now you can identify which stage is having issues and start working on >> that. >> 5.Most likely some preprocessing corrections will give proper results in >> docker >> >> All the best >> >> >> On Fri, 16 Sep, 2022, 12:06 pm Gabriel Sousa, <gab...@gsousa.com.br> >> wrote: >> >>> Hi there, >>> >>> I'm new to this group, and also new to using Tesseract in general. >>> >>> We use py-tesseract for a few data extraction, not many cases, at the >>> company I work for and for no apparent reason, tesseract text extraction >>> stopped working from one deploy to another. >>> >>> It should extract a word such as 'JANUARY' - and it used to do this just >>> fine, but now it reads 'J ANUARY'. The service is running on a docker >>> container. So everything is using the same version from the last time all >>> the tests passed, and now it just breaks when testing something that was >>> not altered at all. >>> >>> Not sure if this is even possible, to me this just seems impossible, but >>> I've went as far as I can in terms of checking code and dependencies and >>> nothing fixes the problem. Outside the container, everything works just >>> fine, tesseract only returns a wrong value INSIDE the container... >>> >>> Anyways... any thoughts regarding how to look for a fix or any ideias >>> overall are welcomed. >>> >>> Thank you >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/90c76920-8375-4c28-9d99-8f8f8249a14fn%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/90c76920-8375-4c28-9d99-8f8f8249a14fn%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/2e52e0d8-27e3-41b4-988e-64ffe1c04f5en%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/2e52e0d8-27e3-41b4-988e-64ffe1c04f5en%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGy9PENxBT8srQ4Ru7g8GN%2BvW86MKjYHB_QZCbZcNA7t7WMXDA%40mail.gmail.com.