Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Ben Crowell Mon, 10 May 2021 15:20:52 -0700

I compiled tesseract from source, which gave me 
version 5.0.0-alpha-20210401-102-g4374, and used the latest grc.traineddata 
file. To get a measure of what's going on, I decided to count the number of 
Greek words rendered as Greek in the first 7 lines of this text, which 
contain 22 actual Greek words.


tesseract 4.1.1, eng+grc -- 14% correct

tesseract 5.0.0 on my machine, eng+grc -- 41% correct

tesseract 5.0.0 on my machine, eng+ell -- 68% correct

tesseract 5.0.0 on archive.org -- 55% correct

Several things are similar in your results and mine. The incorrect scanning 
of ἱερον when surrounded by English words no longer seems to occur in 
5.0.0. The word μοι is usually rendered incorrectly, but this may be 
because there seems to be broken type that causes the descender on the mu 
to be omitted. Μουσα is read incorrectly as Movca, which is probably 
because this personification of the Muse isn't in the dictionary.

One thing that I hadn't noticed previously is that the accentuation in this 
text is weird. Although the 18th-century typesetter included the breathing 
marks, which aren't used in modern Greek, they left out all the acute, 
grave, and circumflex accents, which would usually have been included in a 
modern typesetting of an ancient Greek text. So it may be that the 
dictionary for grc is more appropriate, but the character recognition for 
ell is better here. I think this can be tested by typesetting the same 7 
lines with and without accents.
On Monday, May 10, 2021 at 7:34:34 AM UTC-7 Merlijn Wajer wrote:

> Hi Ben,
>
> On 10/05/2021 15:09, Ben Crowell wrote:
> > Hi Merlijn,
> > 
> > Thanks very much for your reply. It's encouraging that you were able to 
> get 
> > somewhat better results. However, I'm not able to reproduce them. When I 
> > use -l eng+ell, the results are still very poor:
> > 
> > 1. Evverre declare wot to me, Movca Muse,
> > avopa the man voAvtpotrov of many fortunes,
> > ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very
> > much, eves when ewepoev he had destroyed
> > i d city T { Troy:
> > lepov troAscOpor the sacred city Tons of Troy :
> > we Se and saw aorea towns «at and eyvo
> > learnt vooy the mood πολλων ανθρωπων οἳ
> > 
> > The text uses ancient Greek vocabulary and accentuation, so it actually 
> > makes sense to use grc, not ell.
>
> Ah, my bad.
>
> > 
> > I didn't understand what you meant by "using the Archive.org Tesseract 
> > stack," but a web search on your name led me to archive-pdf-tools, which 
> > you're the author of. It's great to have help from someone who's clearly 
> > very expert. I just don't know how to diagnose what is different between 
> > your setup and mine. It looks like you did the whole first page rather 
> than 
> > the piece I posted, so there may be a difference in how the image was 
> > prepared. I just zoomed in on the archive.org page, took a screenshot, 
> > cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, 
> which 
> > seems to be the latest official release. Are you running a version 
> compiled 
> > from the latest source or something? My 
> > file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which came 
> > from installing the debian package tesseract-ocr-grc, is dated 2017, 
> which 
> > seems old, and is 2.2 Mb. The version 
> > at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like it 
> was 
> > changed around 2018. I could try just replacing the file with the newer 
> > version, but I have no idea whether that's a reasonable thing to do, 
> since 
> > I don't know anything about how the software is designed.
>
> "using the Archive.org Tesseract stack" means that archive.org will
> automatically run Tesseract OCR on uploaded content and make those
> results available (so you can compare with your local results). Because
> this book predates the integration of Tesseract, I submitted the content
> for re-OCRing, using Tesseract, in an attempt to reproduce your results.
>
> I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek
> "ell".
>
> The version that is being used is Tesseract "5.0.0-alpha-20201231" [1],
> the language packs are the latest ones from Git, I believe. Maybe it
> would be worth giving the latest version a shot and see if it yields
> better results. There is an ubuntu ppa [2] with development
> snapshots/versions. Then, if the latest version still results in
> unsatisfying results, it would be worth trying to investigate why?
>
>
> Hope this helps,
> Cheers,
> Merlijn
>
> [1]
>
> https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231
> [2] http://ppa.launchpad.net/alex-p/tesseract-ocr-devel
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4cd616ab-e4e1-4ce0-9960-82a5f4421df9n%40googlegroups.com.

Re: [tesseract-ocr] Diagnosing and fixing poor precision on mixed Greek-English text

Reply via email to