Yes, finetuning can be done. Please see https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tutorial-guide-to-lstmtraining If you already have scanned images and their box files you can also try makefile based training using the tesstrain repo.
On Fri, Mar 19, 2021 at 2:31 PM avinash singh <avinasht...@gmail.com> wrote: > Hello Shree, > > Thank you for your reply, > > We have used tesseract 4.0 alpha > > The Training Data is used from the below > > https://github.com/tesseract-ocr/tessdata_best > > https://tesseract-ocr.github.io/tessdoc/Data-Files.html > > > Sharing a doc with the results of the tesseract 4.0 alpha for the same > image you shared and the expected results. > > Also, please let us know if there is any method to fine-tune the incorrect > characters. > > > On Mon, Mar 15, 2021 at 8:15 PM avinash singh <avinasht...@gmail.com> > wrote: > >> Hello Shree, >> >> Thank you for your reply, >> >> We have used tesseract 4.0 alpha >> >> The Training Data is used from the below >> >> https://github.com/tesseract-ocr/tessdata_best >> >> https://tesseract-ocr.github.io/tessdoc/Data-Files.html >> >> >> Sharing a doc with the results of the tesseract 4.0 alpha for the same >> image you shared and the expected results. >> >> Also, please let us know if there is any method to fine-tune the >> incorrect characters. >> >> >> >> >> >> >> >> >> >> On Monday, March 15, 2021 at 2:19:42 PM UTC+5:30 shree wrote: >> >>> See attached image from a screenshot of Malayalam wiki and the OCRed >>> text using traineddata from tessdata_best, tessdata_fast and tessdata >>> To me it seems like recognition is 90+% correct. >>> >>> On Sunday, March 14, 2021 at 6:09:17 AM UTC+5:30 shree wrote: >>> >>>> You have not stated the version of tesseract that you are using. >>>> >>>> >We downloaded some online training data available for the language >>>> Malayalam >>>> >>>> You have not mentioned from where you got it. Are these the official >>>> traineddata files? >>>> >>>> >we found that few special characters in the language are not picked up >>>> by the training data properly. >>>> >>>> Which characters? >>>> >>>> >Current achieved 60% accuracy >>>> >>>> With the LSTM engine, better results are expected. >>>> >>>> Please share a sample image with its expected result. >>>> >>>> You can also try >>>> >>>> https://ocr.sanskritdictionary.com/ >>>> >>>> >>>> >>>> On Sun, Mar 14, 2021, 00:41 avinash singh <avina...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are working on a project for underprivileged kids, we need to build >>>>> an OCR for the Malayalam language. >>>>> >>>>> We downloaded some online training data available for the language >>>>> Malayalam, the current accuracy is around 60%, we found that few special >>>>> characters in the language are not picked up by the training data >>>>> properly. >>>>> >>>>> So we wanted to fine-tune the current training data, we did some >>>>> research and then downloaded Jtessbox editor for creating training data >>>>> but >>>>> we couldn't edit the incorrect character. >>>>> >>>>> then we tried the QT-Box editor, we were able to edit the incorrect >>>>> letters but we couldn't generate the training data through the software >>>>> >>>>> Finally, we tried Cygwin with the command line to generate the custom >>>>> data but we failed to combine the training data >>>>> >>>>> As this is for an NGO our company wants to close this project with the >>>>> current achieved 60% accuracy, I really wish to complete this as the >>>>> English translation is completely wrong can someone please guide us on how >>>>> to train the data >>>>> >>>>> Any help would be much appreciated >>>>> Thanks in advance >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/84a6fc1f-300a-4aac-85b8-99c47a7d88f4n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/84a6fc1f-300a-4aac-85b8-99c47a7d88f4n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/mw7kSw4DbqE/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/95b01d1a-3b3d-4ade-8b98-80fa57eb30b0n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/95b01d1a-3b3d-4ade-8b98-80fa57eb30b0n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAF_YBQRCkY4vXeH_%3Dnf%3D%2BNSOHh-GH6ey9t0DWq6N9LY5Qk%3D8jw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAF_YBQRCkY4vXeH_%3Dnf%3D%2BNSOHh-GH6ey9t0DWq6N9LY5Qk%3D8jw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVU0LkiviAQKcwpJE_BcxTJ%3Dkh7OKiivJP5Tb%3DMc-G0CA%40mail.gmail.com.