with a help of this webpage :
https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6
i did manage - with enormous addition of improvisation, trial & error, 
stubbornness and blind LMB-clicking, including another 2 failed attempts - 
to produce xyz.traineddata

i ran

$ tesseract -l xyz list.txt a.new.txt

and got catastrophic ocr results, far worse than with plain 
eng.traineddata, which actually did fairly good job, after all - all 
english text is ocr-ed correctly, and also the transliteration italic text 
is ocr-ed good-up-to-the-point, with exception of above mentioned 
characters (those that are not in english latin script)

oh well, i guess _manually_ is the way to fix those ...

but if somebody knows how to improve ocr to the point where those dotted 
characters are also recognized, it would make this world much better place

have fun

aum
On Thursday, March 28, 2024 at 2:45:39 PM UTC aum hren wrote:

> olo company
>
> i am trying to ocr an old (1963) morocco arabic - english dictionary
>
> i have tried jTessBoxEditor for ocr, somehow managed to follow the info on 
> net,
> but at the very end tesseract failed to make final _traindata_ files
>
> my problem is
> the book (dictionary) is basically in english language, so i used eng file 
> for ocr-ing
> but there is also transliteration text, which includes characters that are 
> not present in english language
> although they are latin script
> i tried to train the tesseract for those characters, but failed
> ie from this link:
>
> https://www.youtube.com/watch?v=8GdcyknL1ls
>
> the other info i could find is also a bit confusing
>
> the characters i was trying to train are letters
>
> g z d h r t s l - with dots below and above, plus
> š ž and a weird semi question mark
>
> transliteration script is also _italic_
>
> with help of libre office writer and some trial & error i also managed to 
> identify a (close approximation) of the transliteration font (Latin Modern 
> Roman Unslanted)
>
> can somebody versed in tesseract-ocr training help me train (or do the 
> ocr) for those letters/characters ?
>
> attached are:
> - my train script / font image (font - latin modern roman unslanted)
> - a page from a dictionary which includes most of the characters i am 
> trying to ocr
>
> dictionary has 500+ pages, half is eng-morocco arabic, the other half is 
> morocco arabic-eng, so proper ocr would be truly appreciated
>
> thank you for your help
>
> have fun
>
> aum
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6b1300c1-bf9a-4c90-ae02-992cf7686b3bn%40googlegroups.com.

Reply via email to