Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Ben Crowell Mon, 10 May 2021 06:30:01 -0700

I tried replacing the grc.traineddata file with the newer version, and the 
software still ran, but the results were identical. From the comments on 
git, it looks like the newer version is just optimized for speed.


On Monday, May 10, 2021 at 6:09:02 AM UTC-7 Ben Crowell wrote:

> Hi Merlijn,
>
> Thanks very much for your reply. It's encouraging that you were able to 
> get somewhat better results. However, I'm not able to reproduce them. When 
> I use -l eng+ell, the results are still very poor:
>
> 1. Evverre declare wot to me, Movca Muse,
> avopa the man voAvtpotrov of many fortunes,
> ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very
> much, eves when ewepoev he had destroyed
> i d city T { Troy:
> lepov troAscOpor the sacred city Tons of Troy :
> we Se and saw aorea towns «at and eyvo
> learnt vooy the mood πολλων ανθρωπων οἳ
>
> The text uses ancient Greek vocabulary and accentuation, so it actually 
> makes sense to use grc, not ell.
>
> I didn't understand what you meant by "using the Archive.org Tesseract 
> stack," but a web search on your name led me to archive-pdf-tools, which 
> you're the author of. It's great to have help from someone who's clearly 
> very expert. I just don't know how to diagnose what is different between 
> your setup and mine. It looks like you did the whole first page rather than 
> the piece I posted, so there may be a difference in how the image was 
> prepared. I just zoomed in on the archive.org page, took a screenshot, 
> cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, which 
> seems to be the latest official release. Are you running a version compiled 
> from the latest source or something? My 
> file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which came 
> from installing the debian package tesseract-ocr-grc, is dated 2017, which 
> seems old, and is 2.2 Mb. The version at 
> https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like it was 
> changed around 2018. I could try just replacing the file with the newer 
> version, but I have no idea whether that's a reasonable thing to do, since 
> I don't know anything about how the software is designed.
> On Monday, May 10, 2021 at 3:39:09 AM UTC-7 Merlijn Wajer wrote:
>
>> Hi Ben, 
>>
>> On 09/05/2021 21:33, Ben Crowell wrote: 
>> > I'm trying to OCR a book that is written in interspersed Greek and 
>> English: 
>> > 
>> > https://archive.org/details/odysseyofhomerco01gile/page/n5/mode/2up 
>> > 
>> > Here is a sample of text from the first page: 
>> > 
>> > [image: a.jpg] 
>> > 
>> > I'm running tesseract 4.1.1 on linux, with the tesseract-ocr-grc 
>> package 
>> > installed. Here's the command I'm using to OCR this sample: 
>> > 
>> > tesseract a.jpg temp -l eng+grc 
>> > 
>> > Here is the result: 
>> > 
>> > 1. Evverre declare wot to me, Movca Muse, 
>> > ανδρα the man voAvtpotrov of many fortunes, 
>> > os who wAayyx@n wandered μαλα πολλα very 
>> > much, eves when ewepoev he had destroyed 
>> > i d city T { Troy: 
>> > lepov troAscOpor the sacred city Tons of Troy : 
>> > we Se and saw aorea towns «at and eyvo 
>> > learnt vooy the mood roAAwy avOpwror of 
>> > 
>> > Basically it almost never recognizes Greek as Greek, and instead tries 
>> to 
>> > read it as English 95% of the time. Here is what I get if I just tell 
>> > tesseract to treat it as Greek: 
>> > 
>> > 1. ἔννεπε ἀδοίατο μοι ἴο 1π0, ἴἥουσα δίαΞο, 
>> > ανδρα {11 τπᾶπι πολύτροπον οἱ ΤΩΔ}Υ ἰοτέιπο5, 
>> > ὃς ψ|ὸ πλαγχθὴ παπάθιοα μαλα πολλα νετῦ 
>> > τω ποἢ, ἐπὲῦ ΠῸπ ἐπέρσεν ᾿ἰ6 πα ἀεβίτογοά 
>> > ; ἀο Τ {Ττου: 
>> > ἱερον πτολίεθρον [116 ΞΔογβα οἷἵγ Τροιης οἷ ΤτοΥ : 
>> > ἰδὲ δε ἀμ 5 αστεῶ ἰο 8 καὶ ἃπά εγνω 
>> > Ἰρατηῦ νοὸν {πῸ Ἰηοοὰ πολλων ανθρωπὼν οἵ 
>> > 
>> > This seems odd to me. Although it still makes some errors, such as 
>> reading 
>> > Μουσα as ἴἥουσα on the first line, it now gets the common word ἱερον 
>> > (holy) correct, whereas in the original attempt, it rendered it as 
>> lepov, 
>> > which is not a word in either language. If it's capable of correctly 
>> > interpreting ἱερον, which is presumably in its dictionary, then I don't 
>> > understand why, when I use eng+grc, it doesn't get it right. 
>> > 
>> > I tried cropping this sample so it was only the single word: 
>> > 
>> > [image: aa.jpg] 
>> > When I read this using -l eng+grc, it gets it right. So it seems as 
>> though 
>> > it's perfectly capable of both recognizing this word as Greek and 
>> properly 
>> > OCRing it, but somehow it's reluctant to do so when some of the 
>> surrounding 
>> > text is in English. 
>> > 
>> > So in summary, although there are some errors that may have to do with 
>> > image quality or not being trained on this font, there is also some 
>> other 
>> > kind of problem where tesseract doesn't like to "switch gears" from one 
>> > language to the other. 
>> > 
>> > Can anyone help with diagnosing and/or fixing this problem? 
>> > 
>> > Could the issue have anything to do with the fact that the Latin 
>> letters 
>> > are upright, while the Greek ones are in a slanted/italic font? Does 
>> the 
>> > neural network have a preference for English because the English corpus 
>> it 
>> > was trained on was so huge compared to the Greek one? 
>>
>> I took the liberty to re-run OCR for that item using the Archive.org 
>> Tesseract stack (and also provide Greek as a language), and this is the 
>> result of the quoted paragraph - it's not perfect, but better than what 
>> you are seeing I think): 
>>
>> > i SOMERS ODYSSEY. 
>> > 
>> > 
>> > BOOK I. 
>> > 
>> > 
>> > 1. Έννεπε declare µοι to me, Movca Muse, 
>> > avdpa the man πολυτροπον of many fortunes, 
>> > os who πλαγχθη wandered pada πολλα very 
>> > much, eves when επερσεν he had destroyed 
>> > ἱερον πτολιεθρον the sacred city Τροιης of Troy : 
>> > we δε and saw αστεα towns και and εγνω 
>> > learnt vooy the mood πολλων ανθρωπων of 
>> > many men, πολλα δε αλγεα but many sorrows 
>> > oye he indeed παθε suffered ὁν κατα θυµον in 
>> > his soul, apyvper'os whilst grasping ἦν τε Wyn? 
>> > both his own life και and νοστον the return erat. 
>> > pov of his companions. Adda but ουδε not even 
>> > ὡς thus ερρυσατο did he save έταρους his com- 
>> > panions, iewevos περ though bent upon it: 
>> > ολοντο yap for they perished σφετερησιν ατασ- 
>> > σθαλιῃσι by their own phrensies, νηπιοι fools, 
>> > οἱ who κατα ησθιον ate up βους the oxen 
>> > Heduovo of the Sun ὝὙπεριονος who rolls above 
>> > Us : autap but ὁ he αφειλετο took away Tors 
>>
>> I wonder if the problem you were seeing was related to using Ancient 
>> Greek (grc) as opposed to Greek (ell)? These are the parameters that 
>> were used just now: 
>>
>> > ocr_parameters -l eng+ell 
>>
>> Hope this helps. 
>>
>> Cheers, 
>> Merlijn 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/425e5b77-8728-436e-b3aa-76c78a236592n%40googlegroups.com.

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Reply via email to