Yes, finetuning can be done. Please see
https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tutorial-guide-to-lstmtraining
If you already have scanned  images and their box files you  can also try
makefile based training using the tesstrain repo.

On Fri, Mar 19, 2021 at 2:31 PM avinash singh <avinasht...@gmail.com> wrote:

> Hello Shree,
>
> Thank you for your reply,
>
> We have used tesseract 4.0 alpha
>
> The Training Data is used from the below
>
> https://github.com/tesseract-ocr/tessdata_best
>
> https://tesseract-ocr.github.io/tessdoc/Data-Files.html
>
>
> Sharing a doc with the results of the tesseract 4.0 alpha for the same
> image you shared and the expected results.
>
> Also, please let us know if there is any method to fine-tune the incorrect
> characters.
>
>
> On Mon, Mar 15, 2021 at 8:15 PM avinash singh <avinasht...@gmail.com>
> wrote:
>
>> Hello Shree,
>>
>> Thank you for your reply,
>>
>> We have used tesseract 4.0 alpha
>>
>> The Training Data is used from the below
>>
>> https://github.com/tesseract-ocr/tessdata_best
>>
>> https://tesseract-ocr.github.io/tessdoc/Data-Files.html
>>
>>
>> Sharing a doc with the results of the tesseract 4.0 alpha for the same
>> image you shared and the expected results.
>>
>> Also, please let us know if there is any method to fine-tune the
>> incorrect characters.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Monday, March 15, 2021 at 2:19:42 PM UTC+5:30 shree wrote:
>>
>>> See attached image from a screenshot of Malayalam wiki and the OCRed
>>> text using traineddata from tessdata_best, tessdata_fast and tessdata
>>> To me it seems like recognition is 90+% correct.
>>>
>>> On Sunday, March 14, 2021 at 6:09:17 AM UTC+5:30 shree wrote:
>>>
>>>> You have not stated the version of tesseract that you are using.
>>>>
>>>> >We downloaded some online training data available for the language
>>>> Malayalam
>>>>
>>>> You have not mentioned from where you got it. Are these the official
>>>> traineddata files?
>>>>
>>>> >we found that few special characters in the language are not picked up
>>>> by the training data properly.
>>>>
>>>> Which characters?
>>>>
>>>> >Current achieved  60% accuracy
>>>>
>>>> With the LSTM engine, better results are expected.
>>>>
>>>> Please share a sample image with its expected result.
>>>>
>>>> You can also try
>>>>
>>>> https://ocr.sanskritdictionary.com/
>>>>
>>>>
>>>>
>>>> On Sun, Mar 14, 2021, 00:41 avinash singh <avina...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are working on a project for underprivileged kids, we need to build
>>>>> an OCR for the Malayalam language.
>>>>>
>>>>> We downloaded some online training data available for the language
>>>>> Malayalam,  the current accuracy is around 60%, we found that few special
>>>>> characters in the language are not picked up by the training data 
>>>>> properly.
>>>>>
>>>>> So we wanted to fine-tune the current training data, we did some
>>>>> research and then downloaded Jtessbox editor for creating training data 
>>>>> but
>>>>> we couldn't edit the incorrect character.
>>>>>
>>>>> then we tried the QT-Box editor, we were able to edit the incorrect
>>>>> letters but we couldn't generate the training data through the software
>>>>>
>>>>> Finally, we tried Cygwin with the command line to generate the custom
>>>>> data but we failed to combine the training data
>>>>>
>>>>> As this is for an NGO our company wants to close this project with the
>>>>> current achieved  60% accuracy, I really wish to complete this as the
>>>>> English translation is completely wrong can someone please guide us on how
>>>>> to train the data
>>>>>
>>>>> Any help would be much appreciated
>>>>> Thanks in advance
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/84a6fc1f-300a-4aac-85b8-99c47a7d88f4n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/84a6fc1f-300a-4aac-85b8-99c47a7d88f4n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/tesseract-ocr/mw7kSw4DbqE/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/95b01d1a-3b3d-4ade-8b98-80fa57eb30b0n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/95b01d1a-3b3d-4ade-8b98-80fa57eb30b0n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAF_YBQRCkY4vXeH_%3Dnf%3D%2BNSOHh-GH6ey9t0DWq6N9LY5Qk%3D8jw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAF_YBQRCkY4vXeH_%3Dnf%3D%2BNSOHh-GH6ey9t0DWq6N9LY5Qk%3D8jw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVU0LkiviAQKcwpJE_BcxTJ%3Dkh7OKiivJP5Tb%3DMc-G0CA%40mail.gmail.com.

Reply via email to