On Wednesday, January 27, 2021 at 5:28:27 AM UTC-5 Merlijn Wajer wrote: > > The Internet Archive has switched to using Tesseract for all our OCR,
That's great to hear! It's certainly been a long time coming. Nick White & I tried to get this to happen 7 years ago and even volunteered to help, but were ignored. https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages > and I'm hoping that we can record exactly what version of language files > was used for a specific OCR job. Yes, provenance of the OCR'd text and the software used to derive it would be very valuable. Did you do any type of quality / performance comparative study as part of the switch or evaluation leading up to it? Can you share the results? Will you be reprocessing the backlog of books which were originally done with ABBYY? As I mention in that thread from 7 years ago, there's a subset which, anecdotally, looks like it might have been processed using ABBYY "fast" mode, accounting for extra low quality output. These would be especially useful for reprocessing. Are you looking at any higher level processing (e.g. voting / merging results from multiple scans/editions) to improve the raw quality further? Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/96d44469-343c-496e-9cd2-f6ee652f3d33n%40googlegroups.com.