Hi,

On 27/01/2021 12:42, Shree Devi Kumar wrote:
>> The Internet Archive has switched to using Tesseract for all our OCR,
> 
> I am so happy to hear this. It will be great to have the Indic languages
> that were marked as non-ocrable so far be converted to text correctly on
> Internet Archive.

Right, that should now just work -- and we now also support the Fraktur
script and some more languages (all made possible by the great work on
Tesseract!).

> Is there any page with instructions to do this? Can a language be specified
> while OCRing? eg. Better results are many times received using
> script/Devanagari instead of san for Sanskrit.

We switched over completely mid-December 2020; and I'm still working
through a feature and documentation backlog, including document
discovery. But in general, if you set the right ISO-639 language (code
or name) in the "language" metadata field, that language should be used
exclusively - you can also set multiple languages. Potentially you could
also set a script in the language field; I must admin I have not tried
that yet.

If you omit the language field all together, the module will figure out
what scripts are being used, then perform OCR with the detected scripts
as data packs, then perform language analysis on the corpus, and finally
through some heuristics pick the (potentially multiple) languages it
believes the piece is written in, and perform a final OCR step using
those languages and their associated scripts. (Repo with this python
instrumentation code will follow soon)

I wouldn't mind chatting about this some more, but perhaps (?) off-list
would be a better way to do that - either way is fine by me.

> Regarding your question about tessdata, there have only been minor changes
> to tessdata files but adding a tag is a good idea. I suggest you post this
> as a feature request in the repo.

I've created one: https://github.com/tesseract-ocr/tessdata_fast/issues/26

Cheers,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bee7f500-6172-9c23-542e-ed96ecda1d4d%40archive.org.

Reply via email to