Re: [tesseract-ocr] Re: New release for tessdata_{fast,best}?

Tom Morris Fri, 19 Feb 2021 09:28:01 -0800

Hi Merlijn,

Apologies for the delayed reply. I'll definitely be in touch about the 
results of your OCR comparison study, but I'd encourage you to release it 
publically. One good way to give back to the open source community that the 
Internet Archive takes advantage of is to share knowledge and code openly. 
I know that can be a challenge given the historical culture there, but it'd 
be nice to see that change.


As for not digitizing duplicates, I think there might be a lot more than 
you suspect. I did a study 7 years ago for a library who wanted to link 
public domain "classics" to their library catalog and found a large number 
of duplicate scans for these works, with a wide range of OCR quality 
scores. At the extreme, there's "The Pilgrim's Progress" with 121 scans 
that have OCR average character confidence scores ranging from the mid-50s 
to 94.68. Some of these are unique editions (children's versions, etc), but 
others are duplicate scans of the same, or very closely related, editions 
(e.g. subsequent printings of an edition in later years). The Odyssey, The 
Illiad, Faust, a few of Shakespeare's works all have over 50 scans (in 
English). For the 243 works that I analyzed, there were a total of 2576 
scans in English (at least that I was able to locate at the time).

The full list of scans that I analyzed is 
here: https://github.com/tfmorris/openlibrary-utils/tree/master/data
and the results are here: 
https://docs.google.com/spreadsheets/d/1MzQCqoyiPCiQTak_tJoWDFMXEc11_jVc909W7RvkDW0/edit?usp=sharing

For leveraging multiple scans to improve quality, my initial thought was to 
extend Ismet Yalniz's work, but it's been years since I looked at the 
literature, so there may have been more advances in the intervening time. 
https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=982
http://maroo.cs.umass.edu/pub/web/getpdf.php?id=970

Hmm, actually they did a small scale study of exactly this use case with 
positive results, combining three editions of Wuthering Heights and three 
of Sense and Sensibility to improve the single best OCR scores from .885 to 
.924 for the combination and .911 to .954, respectively. 
https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1104

The papers citing that work would be a good starting point for 
investigation of the current state of the art:
https://scholar.google.com/scholar?cites=12541060738867624162&as_sdt=40000005&sciodt=0,22&hl=en

The ICDAR 2019 Post-OCR Text Correction competition might also be worth 
reviewing: 
https://hal.archives-ouvertes.fr/hal-02304334/document

Good luck! It's great to see someone working on improving the quality of 
Internet Archive OCR.

Best,
Tom

On Monday, February 1, 2021 at 8:50:57 PM UTC-5 Merlijn Wajer wrote:

> Hi Tom, 
>
> On 30/01/2021 21:25, Tom Morris wrote: 
> > On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote: 
> > 
> > 
> > The Internet Archive has switched to using Tesseract for all our OCR, 
> > 
> > 
> > That's great to hear! It's certainly been a long time coming. Nick White 
> > & I tried to get this to happen 7 years ago and even volunteered to 
> > help, but were ignored. 
> > 
> https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages
>  
>
> I've been working with the Internet Archive for only a couple of years, 
> and mostly worked on other parts of the digitisation efforts - I wasn't 
> aware of that thread. Sorry to see that it wasn't picked up. 
>
> > 
> > and I'm hoping that we can record exactly what version of language 
> > files 
> > was used for a specific OCR job. 
> > 
> > 
> > Yes, provenance of the OCR'd text and the software used to derive it 
> > would be very valuable. 
>
> Agreed. 
>
> > Did you do any type of quality / performance comparative study as part 
> > of the switch or evaluation leading up to it? Can you share the results? 
>
> We did internally compare Abbyy and Tesseract results on some books 
> microfilm. We found the results to be mostly similar, some parts a 
> little better, other a little worse. In particular, I believe for 
> newspaper segmentation there are some areas that can be improved with 
> Tesseract (even though the current state is quite good already), but the 
> recognition engine came out quite strong. 
>
> I am happy to share more details of our evaluation off list - please 
> drop me an email if you're interested. 
>
> > Will you be reprocessing the backlog of books which were originally done 
> > with ABBYY? As I mention in that thread from 7 years ago, there's a 
> > subset which, anecdotally, looks like it might have been processed using 
> > ABBYY "fast" mode, accounting for extra low quality output. These would 
> > be especially useful for reprocessing. 
>
> Do you have some specific collections in mind? I believe that we 
> currently do not have a lot of computational capacity to spare, but 
> could definitely target specific collections. 
>
> > Are you looking at any higher level processing (e.g. voting / merging 
> > results from multiple scans/editions) to improve the raw quality 
> further? 
>
> That is an interesting idea. I do know that we usually do not digitise 
> duplicates, as digitisation is a relatively costly process. That said, 
> there are likely still plenty of duplicates to be found which could make 
> this technique something we could try. Did you have any particular 
> technique in mind? 
>
> Cheers, 
> Merlijn 
>
> PS: my invitation to share more details applies to others on this list 
> too. 
> We also have a blog post up here detailing some of work 
> (
> https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-communities-comes-through-on-19th-century-newspapers-and-books-and-periodicals/),
>  
>
> thanking the open source community. 
>
> We also have a (Slack) channel (not a mailing list, sorry) for OCR 
> discussion, in case some of you are interested in helping out one way or 
> another (drop me an email and I can try to get you set up). 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/905a3aec-4a19-4806-80fb-0b451eadeed3n%40googlegroups.com.

Re: [tesseract-ocr] Re: New release for tessdata_{fast,best}?

Reply via email to