Hi, On 27/01/2021 12:42, Shree Devi Kumar wrote: >> The Internet Archive has switched to using Tesseract for all our OCR, > > I am so happy to hear this. It will be great to have the Indic languages > that were marked as non-ocrable so far be converted to text correctly on > Internet Archive.
Right, that should now just work -- and we now also support the Fraktur script and some more languages (all made possible by the great work on Tesseract!). > Is there any page with instructions to do this? Can a language be specified > while OCRing? eg. Better results are many times received using > script/Devanagari instead of san for Sanskrit. We switched over completely mid-December 2020; and I'm still working through a feature and documentation backlog, including document discovery. But in general, if you set the right ISO-639 language (code or name) in the "language" metadata field, that language should be used exclusively - you can also set multiple languages. Potentially you could also set a script in the language field; I must admin I have not tried that yet. If you omit the language field all together, the module will figure out what scripts are being used, then perform OCR with the detected scripts as data packs, then perform language analysis on the corpus, and finally through some heuristics pick the (potentially multiple) languages it believes the piece is written in, and perform a final OCR step using those languages and their associated scripts. (Repo with this python instrumentation code will follow soon) I wouldn't mind chatting about this some more, but perhaps (?) off-list would be a better way to do that - either way is fine by me. > Regarding your question about tessdata, there have only been minor changes > to tessdata files but adding a tag is a good idea. I suggest you post this > as a feature request in the repo. I've created one: https://github.com/tesseract-ocr/tessdata_fast/issues/26 Cheers, Merlijn -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bee7f500-6172-9c23-542e-ed96ecda1d4d%40archive.org.