On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote:

>
> The Internet Archive has switched to using Tesseract for all our OCR,


That's great to hear! It's certainly been a long time coming. Nick White & 
I tried to get this to happen 7 years ago and even volunteered to help, but 
were ignored.
https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages
 

> and I'm hoping that we can record exactly what version of language files 
> was used for a specific OCR job.


Yes, provenance of the OCR'd text and the software used to derive it would 
be very valuable.

Did you do any type of quality / performance comparative study as part of 
the switch or evaluation leading up to it? Can you share the results?

Will you be reprocessing the backlog of books which were originally done 
with ABBYY? As I mention in that thread from 7 years ago, there's a subset 
which, anecdotally, looks like it might have been processed using ABBYY 
"fast" mode, accounting for extra low quality output. These would be 
especially useful for reprocessing.

Are you looking at any higher level processing (e.g. voting / merging 
results from multiple scans/editions) to improve the raw quality further?

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/96d44469-343c-496e-9cd2-f6ee652f3d33n%40googlegroups.com.

Reply via email to