[tesseract-ocr] Assistance Needed with OCR of Arabic Digits

Sara Elshobaky Wed, 22 Jan 2025 00:54:38 -0800

Hi,

I’m currently working on OCR for Arabic digits (also known as Hindi
numbers) extracted from table cells in old documents. After cutting the
table cells, I’ve been OCRing the content individually. However, I’ve
noticed some repetitions in the recognized digits.


While visualizing the coordinates of each digit, I discovered that extra
bounding boxes were generated. Do you have any suggestions for resolving
this issue?

I’ve attached the visual results, highlighting the inaccuracies with red
circles, along with the original cell image for your reference.

I am utilizing a tuned version of the Arabic.traineddata model, which was
adjusted using training lines from the same collection of books that I’m
OCRing. The OCR process is being done with PSM=6 and OEM=1.

Tesseract 5.5.0,
Python 3.13.1,
tesserocr  2.7.1,
 leptonica-1.82.0,

Thank you!

Sara Elshobaky


[image: ar_nums2.png][image: 428_349_160_751.png]

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGmz-XXKb6JcM7HbR41m54F%3D5wQ4%3DunWpWOiOCKwEuPWFDvc5Q%40mail.gmail.com.

[tesseract-ocr] Assistance Needed with OCR of Arabic Digits

Reply via email to