I am trying to scan a Santali book with multiple character (Ol chiki script 
+ English script + Odia script) with gImageReader 3.3.1 (17fa17) which uses 
Tesseract 4.1.0 but unable to get satisfactory results.

I have tried with English + Odia script are working fine they are giving 
very good result. But when I use Santali + Odia or English + Santali or 
Santali + Odia + English the output text becomes Odia, English or Odia and 
English respectively, instead of showing Ol chiki text in place. I have a 
file available for testing 
<https://www.dropbox.com/s/xwvin9bqkwc4zol/Santali-Odia-English.tiff?dl=0>.

Also, by only using Santali tessdata it transliterate English and Odia 
words as Ol Chiki script.

When I use "*sat.tessdata*" to scan a normal santali image, it worked well.

Note: Ol chiki is the main writing script of Santali people approved by 
government of India. I think Ol Chiki is a new script not well supported by 
many software so the processed image text output always shows boxes, I 
solved this problem by coping it to the Notepad and saving. Exporting it to 
pdf is ok, I created editable text from it, no problem. I have created many 
OCR editable pdf with gImageReader.

My question is how to get combined multiple language output in Santali, 
Odia and English. Also I want to know why the text output of image when 
processed giving output for English and Odia but not for Santali or vice 
versa.

I have tried to train the language, it is taking a lot of time, I have 
little knowledge on coding. If their is any problem with sat.tessdata then 
i can take up with learning with Tesseract training.

I have used tessdata of

   - Santali - https://github.com/indic-ocr/tessdata/tree/master/sat
   - Odia - https://github.com/indic-ocr/tessdata/tree/master/ori
   - English - default of gImageReader

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b92d12f0-95d4-4348-aedb-c2fe6b071f5d%40googlegroups.com.

Reply via email to