Hi,

On 12/02/2022 22:13, Alberto Simoes wrote:
Hi

I am OCRing a lot of documents. I have a document with very poor quality, and surely nothing will be recognized. But I need a stable pipeline, and while I was expecting tesseract just to return an empty document, I am getting this error:

Detected 958 diacritics
Error during processing.

Is there anything I can do to use tesseract more reliably, without the chance of getting it to just die?

You can try using a different binarisation method, or cleaning up the images before doing OCR. Do you have an example you can share?

Tesseract 5.0.0 should support -c thresholding_method=2 and additionally you can pass the --dpi 300 (or whatever value it is) for your image. That might make it more robust even without pre-processing your images.

By the way, I am using it through pytesseract, but I do not think that is the problem.

I don't know if pytesseract supports these extra options, so you might have to fiddle with that.

Regards,
Merlijn

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c690418f-d2b4-61ff-f875-a668bce3deaf%40archive.org.

Reply via email to