Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Ben Crowell Wed, 12 May 2021 17:34:17 -0700

I made some efforts to improve the performance of tesseract on this text. I 
made an English dictionary consisting only of words used in a collection of 
7 English translations of Homer, so that the dictionary includes words 
like Acheloüs but doesn't include words like the (German? Dutch?) name Kai, 
which was being used as a reading for the common Greek word και. I made a 
Greek dictionary consisting only of the words that actually occur in the 
Odyssey, with all acute, grave, and circumflex accents removed as in the 
text I'm trying to scan. So for instance, this custom dictionary contains 
πολλων, the form in the text, not πολλῶν. I also trained tesseract a little 
bit on the Greek font used in this book, although I don't know if the 
amount of text I provided was enough.


After this specialized fine-tuning, the accuracy is still not at all 
acceptable. The result on the above passage looks like this:

1. Εννεπε declare wot to me, Μουσα Muse,
ανδρα the man πολυτροποιν of many fortunes,
os who πλαγχθη wandered μαλα πολλα very
much, επεν when ezepoev he had destroyed
i d city T { Troy:
ἱερον πτολιεθρον the sacred city Τροιης of Troy :
we δε and saw αστεα towns «at and eyvo
learnt vooy the mood πολλων ανιθρωπων of

Only 68% of Greek words are correctly recognized as Greek, and even of 
those, some are misread. Extremely common words like μοι,  ὁς, and και are 
not recognized, although they are mostly recognized when I OCR the text 
with the language set only to Greek. So as far as I can tell, tesseract 
just can't really do this kind of bilingual text with a non-Latin font. Of 
course, there could be something I'm not understanding that would improve 
things.

>From descriptions I've read, it seems that tesseract's neural network is 
designed to try to scan large blocks of text at once, not just individual 
words. I suspect that this makes it unwilling to read Greek as Greek when 
it's surrounded by English. This would help to explain why it reads ὁς 
correctly when in Greek-only mode, but when in English+Greek mode, it reads 
it as os, which isn't even a word in the English dictionary I'm using.

Training it on the book's Greek font may have done as much harm as good. It 
gets words like Μουσα right, which it got wrong before, but it makes errors 
on words like πολυτροπον and ανθρωπων, spelling them as πολυτροποιν and 
ανιθρωπων.

On Monday, May 10, 2021 at 4:42:12 PM UTC-7 Ben Crowell wrote:

> Here is a version of the text that I typeset using xelatex, with the 
> font DejaVu Serif. It has all the accents, which should make it a good 
> typographical match to the data that tesseract was trained on to make the 
> grc file.
> [image: tex_output.png]
> Here is the result:
>
> Ἔννεπε declare pot to me, Movoa Muse,
>
> ἄνδρα the man πολύτροπον of many fortunes,
> oc who πλάγχθη wandered μάλα πολλὰ very
> much, ἐπεὶ when émepoe he had destroyed
> ἱερὸν πτολίεθρον the sacred city Τροίης of Troy:
> ἴδε δε and saw ἄστεα towns Kai and ἔγνω
> learnt voov the mood πολλῶν ἀνθρώπων of
>
> Now 73% of Greek words are recognized as Greek. So this is quite a bit 
> better, but still fairly poor. It seems really odd to me that tesseract is 
> not getting the moon words μοι, ὃς, and καὶ. For comparison, it would be as 
> if tesseract was OCRing an English text and not being able to read "me," 
> "who," and "and."
> On Monday, May 10, 2021 at 3:20:47 PM UTC-7 Ben Crowell wrote:
>
>> I compiled tesseract from source, which gave me 
>> version 5.0.0-alpha-20210401-102-g4374, and used the latest grc.traineddata 
>> file. To get a measure of what's going on, I decided to count the number of 
>> Greek words rendered as Greek in the first 7 lines of this text, which 
>> contain 22 actual Greek words.
>>
>> tesseract 4.1.1, eng+grc -- 14% correct
>>
>> tesseract 5.0.0 on my machine, eng+grc -- 41% correct
>>
>> tesseract 5.0.0 on my machine, eng+ell -- 68% correct
>>
>> tesseract 5.0.0 on archive.org -- 55% correct
>>
>> Several things are similar in your results and mine. The incorrect 
>> scanning of ἱερον when surrounded by English words no longer seems to occur 
>> in 5.0.0. The word μοι is usually rendered incorrectly, but this may be 
>> because there seems to be broken type that causes the descender on the mu 
>> to be omitted. Μουσα is read incorrectly as Movca, which is probably 
>> because this personification of the Muse isn't in the dictionary.
>>
>> One thing that I hadn't noticed previously is that the accentuation in 
>> this text is weird. Although the 18th-century typesetter included the 
>> breathing marks, which aren't used in modern Greek, they left out all the 
>> acute, grave, and circumflex accents, which would usually have been 
>> included in a modern typesetting of an ancient Greek text. So it may be 
>> that the dictionary for grc is more appropriate, but the character 
>> recognition for ell is better here. I think this can be tested by 
>> typesetting the same 7 lines with and without accents.
>> On Monday, May 10, 2021 at 7:34:34 AM UTC-7 Merlijn Wajer wrote:
>>
>>> Hi Ben, 
>>>
>>> On 10/05/2021 15:09, Ben Crowell wrote: 
>>> > Hi Merlijn, 
>>> > 
>>> > Thanks very much for your reply. It's encouraging that you were able 
>>> to get 
>>> > somewhat better results. However, I'm not able to reproduce them. When 
>>> I 
>>> > use -l eng+ell, the results are still very poor: 
>>> > 
>>> > 1. Evverre declare wot to me, Movca Muse, 
>>> > avopa the man voAvtpotrov of many fortunes, 
>>> > ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very 
>>> > much, eves when ewepoev he had destroyed 
>>> > i d city T { Troy: 
>>> > lepov troAscOpor the sacred city Tons of Troy : 
>>> > we Se and saw aorea towns «at and eyvo 
>>> > learnt vooy the mood πολλων ανθρωπων οἳ 
>>> > 
>>> > The text uses ancient Greek vocabulary and accentuation, so it 
>>> actually 
>>> > makes sense to use grc, not ell. 
>>>
>>> Ah, my bad. 
>>>
>>> > 
>>> > I didn't understand what you meant by "using the Archive.org Tesseract 
>>> > stack," but a web search on your name led me to archive-pdf-tools, 
>>> which 
>>> > you're the author of. It's great to have help from someone who's 
>>> clearly 
>>> > very expert. I just don't know how to diagnose what is different 
>>> between 
>>> > your setup and mine. It looks like you did the whole first page rather 
>>> than 
>>> > the piece I posted, so there may be a difference in how the image was 
>>> > prepared. I just zoomed in on the archive.org page, took a 
>>> screenshot, 
>>> > cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, 
>>> which 
>>> > seems to be the latest official release. Are you running a version 
>>> compiled 
>>> > from the latest source or something? My 
>>> > file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which 
>>> came 
>>> > from installing the debian package tesseract-ocr-grc, is dated 2017, 
>>> which 
>>> > seems old, and is 2.2 Mb. The version 
>>> > at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like 
>>> it was 
>>> > changed around 2018. I could try just replacing the file with the 
>>> newer 
>>> > version, but I have no idea whether that's a reasonable thing to do, 
>>> since 
>>> > I don't know anything about how the software is designed. 
>>>
>>> "using the Archive.org Tesseract stack" means that archive.org will 
>>> automatically run Tesseract OCR on uploaded content and make those 
>>> results available (so you can compare with your local results). Because 
>>> this book predates the integration of Tesseract, I submitted the content 
>>> for re-OCRing, using Tesseract, in an attempt to reproduce your results. 
>>>
>>> I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek 
>>> "ell". 
>>>
>>> The version that is being used is Tesseract "5.0.0-alpha-20201231" [1], 
>>> the language packs are the latest ones from Git, I believe. Maybe it 
>>> would be worth giving the latest version a shot and see if it yields 
>>> better results. There is an ubuntu ppa [2] with development 
>>> snapshots/versions. Then, if the latest version still results in 
>>> unsatisfying results, it would be worth trying to investigate why? 
>>>
>>>
>>> Hope this helps, 
>>> Cheers, 
>>> Merlijn 
>>>
>>> [1] 
>>>
>>> https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231
>>>  
>>> [2] http://ppa.launchpad.net/alex-p/tesseract-ocr-devel 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5c509cfe-f300-47a6-a693-dda57dbf4259n%40googlegroups.com.

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Reply via email to