[tesseract-ocr] Incorrect recognition of Latin words inside Arabic text

Naourass Derouichi Fri, 02 Sep 2022 11:12:30 -0700

Hi all, I'm trying to ocr images similar to the attached one, but the error 
rate of Latin words is too high.


I tried all PSMs with the following models from tessdata_best: *ara*, *eng*
, *fra*, *Ara (*in different orders)*. *I even tried finetuning them on the 
font used in the input images.

*Sample output (error in bold):*
قرارلمجلس المنافسة عدد 0028/ق/2022 صبادر25 من شعبان 1443
(28 مارس 2022) والمتعلق بتولي الشركة القابضة للمساهمات
والاستثمارات *«11010108-:2م1]»* للمر اقبة المشتركة على شركة
‎«CMGP Group Sa»‏ وذلك عبراقتناء نسبة14,81 96 من أسيم
رأسمالها وحقوق التصويت المرتبطة به.

The results often have incorrect recognition of Latin words. Is there any 
solution to this issue?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b4c64a93-8da6-45b0-8a3e-03372d1c6be4n%40googlegroups.com.

[tesseract-ocr] Incorrect recognition of Latin words inside Arabic text

Reply via email to