Hello Frank, I am wondering if you have worked on " 3. OCR Kannada inscriptions and keep them in OCR'ed format". I am very interested in multilingual OCR-ing for Kannada inscriptions. You mention Epigraphy documents, might they be Epigraphia Carnatica? In which case I would be grateful for any knowledge you have to share. Thank you, Jajwalya On Friday, May 15, 2020 at 5:39:07 AM UTC-4 Frank wrote:
> Hi, Ive just installed tesseract to OCR some old Epigraphy documents. I > used Google colab as well as a Mac install. All fine, except I am unable to > get the text with IAST...characters are substituted (ā becomes i etc). I > tried using the lang attribute as lat but it doesnt find a latin lang > package and installing latin script didnt help. Ive searched through all of > Shree's work on github, but cant figure this out. I have three objectives: > 1. OCR english pages and search through them > 2. It would be nice to convert the sanskrit into IAST and search through it > 3. OCR Kannada inscriptions and keep them in OCR'ed format-this is > optional- a "good to have" > > Writing the search code doesnt seem to be tough, however the IAST > recognition/transcription is the challenge. Accuracy is not very important > as I have to search through volumes of inscriptions for specific key words > to recategorize a lot of mis categorised inscriptions on my research topic. > Any help would be appreciated. The volume itself doesnt make the Google OCR > solution suggested by Shree elsewhere practicable. > > Im new at Python and tesseract, though have programmed in the past. > Any help is appreciated. > > > On Friday, July 27, 2018 at 6:29:09 AM UTC+2, shree wrote: > >> You can try IAST ones from >> https://github.com/Shreeshrii/tessdata_shreetest?files=1 >> >> On Fri 27 Jul, 2018, 8:27 AM Shree Devi Kumar, <shree...@gmail.com> >> wrote: >> > There is no official traineddata for san_latn or last. I have created some >>> experimental versions but the output is not fully accurate. >>> >>> >>> >>> On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, <jmuc...@gmail.com> >>> wrote: >>> >> You're telling tesseract that your text is in Latin. You need the >>>> traineddata for san-lat. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> >>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com. >>> >>> >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5fdd7933-a7bc-42c3-82ee-4afbb8da40f9n%40googlegroups.com.