[tesseract-ocr] Handling text scans and cleaning

Ajinkya Bobade Mon, 07 Apr 2025 21:09:38 -0700

I have noticed that text cleaning is the most difficult part in OCR
pipeline. I have struggled alot on this part, without properly cleaned text
OCR simply fails in terms of accuracy. In order to handle text cleaning
seperately I created  a GitHub repo that uses AI to clean up all text in a
image. Once the text is cleaned we can choose our own custom OCR models on
it. I have personally seen OCR accuracy shoot up to 99% on a properly
preprocessed and cleaned image.


Here is a Github: https://github.com/ajinkya933/ClearText link.

Regards
Ajinkya

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAHy6iNOjhs7ZY7r26fGzqJOUr2e%2BF3bY%3DeDCHjM-VD7XH5M%3DTA%40mail.gmail.com.

[tesseract-ocr] Handling text scans and cleaning

Reply via email to