Hi Ben, On 10/05/2021 15:09, Ben Crowell wrote: > Hi Merlijn, > > Thanks very much for your reply. It's encouraging that you were able to get > somewhat better results. However, I'm not able to reproduce them. When I > use -l eng+ell, the results are still very poor: > > 1. Evverre declare wot to me, Movca Muse, > avopa the man voAvtpotrov of many fortunes, > ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very > much, eves when ewepoev he had destroyed > i d city T { Troy: > lepov troAscOpor the sacred city Tons of Troy : > we Se and saw aorea towns «at and eyvo > learnt vooy the mood πολλων ανθρωπων οἳ > > The text uses ancient Greek vocabulary and accentuation, so it actually > makes sense to use grc, not ell.
Ah, my bad. > > I didn't understand what you meant by "using the Archive.org Tesseract > stack," but a web search on your name led me to archive-pdf-tools, which > you're the author of. It's great to have help from someone who's clearly > very expert. I just don't know how to diagnose what is different between > your setup and mine. It looks like you did the whole first page rather than > the piece I posted, so there may be a difference in how the image was > prepared. I just zoomed in on the archive.org page, took a screenshot, > cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, which > seems to be the latest official release. Are you running a version compiled > from the latest source or something? My > file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which came > from installing the debian package tesseract-ocr-grc, is dated 2017, which > seems old, and is 2.2 Mb. The version > at https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like it was > changed around 2018. I could try just replacing the file with the newer > version, but I have no idea whether that's a reasonable thing to do, since > I don't know anything about how the software is designed. "using the Archive.org Tesseract stack" means that archive.org will automatically run Tesseract OCR on uploaded content and make those results available (so you can compare with your local results). Because this book predates the integration of Tesseract, I submitted the content for re-OCRing, using Tesseract, in an attempt to reproduce your results. I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek "ell". The version that is being used is Tesseract "5.0.0-alpha-20201231" [1], the language packs are the latest ones from Git, I believe. Maybe it would be worth giving the latest version a shot and see if it yields better results. There is an ubuntu ppa [2] with development snapshots/versions. Then, if the latest version still results in unsatisfying results, it would be worth trying to investigate why? Hope this helps, Cheers, Merlijn [1] https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231 [2] http://ppa.launchpad.net/alex-p/tesseract-ocr-devel -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ac06af9e-c702-483d-6581-916bb41a6f29%40archive.org.