Hi Can I see your traineddata, I wanna try it on my project

On Monday, February 14, 2022 at 5:49:08 PM UTC+8 rkodi...@gmail.com wrote:

> These are the fonts I used.
>
> -Rajeev
>
> On Monday, February 14, 2022 at 3:11:33 PM UTC+5:30 Rajeev Kodippily wrote:
>
>> Hello All,
>>
>> I know that tesseract is not intended to be used on handwritten data, but 
>> I'm trying to tackle a problem that does not really have a straightforward 
>> solution at the moment, which is recognizing handwritten source code. There 
>> are no datasets of labelled handwritten source code to build a model from 
>> scratch.
>>
>> There was a study <https://arxiv.org/abs/1706.00069> done in 2017 where 
>> they evaluated the commercial engine myscript <https://www.myscript.com/>'s 
>> performance on handwritten source code. They created and published an 
>> evaluation dateset of handwritten python code samples. 
>> My attempt is to compare their results with tesseract 4.0 's performance 
>> after using the training tools to train tesseract to recognize their 
>> evaluation data set. 
>>
>> As a first step, I fine tuned tessdata_best by giving it the following 
>> langdata
>>
>> 1. eng.training_text - for this file I gave it the actual ground truth 
>> source code of the handwritten samples ( I ultimately would like the NN to 
>> create a more generalized model by feeding it a lot of python code but as a 
>> first step I thought of just going with the target data itself)
>> 2. eng.wordlist - I gave this file the set of python keywords from most 
>> frequent to least
>> 3.  eng.punc and eng.numbers  - I got rid of the expressions that I know 
>> will never appear on source code and kept the rest. ( keep in my mind the 
>> dateset has only source code, the comments are all removed) 
>>
>> I created the training data using about 27 handwriting fonts I found 
>> online.
>>
>>  I have attached the data and scripts I used and attached the results of 
>> the two images 1.png and 9.png in *Results.txt*
>>
>> For 9.png as you can see it shows a slight improvement as it doesn't have 
>> out of vocab characters and the WER is lower.  I noticed that the model 
>> works well for block letters as in 9.png but still cannot recognize when 
>> the handwriting is  messy, which makes sense. 
>>
>> In  1.png where the handwriting is a bit cursive we can't really say that 
>> the trained model is better.
>>
>>
>> My question is, what other things that I can try to decrease the WER from 
>> default tesseract. What can I try differently ? Again, I know the results 
>> won't be perfect but my objective is to use the training tools and show 
>> that after training, the model will perform better than default tesseract. 
>>
>> I'm going to try training from scratch and training a few layers next, 
>> any thoughts regarding those approaches would also be helpful.
>>
>> I have attached all my files and the training scripts used. 
>>
>> Any feedback would be highly appreciated!
>>
>> Thanks!
>> Rajeev. 
>>
>
-- 
*The contents of this email message and any attachments **thereto** are 
intended solely for the addressee(s) and may contain confidential and/or 
privileged information and may be legally protected from disclosure. If you 
are not the intended recipient of this message or their agent, or if this 
message has been addressed to you in error, please immediately **notify** 
the sender by reply email and delete this message and **its **attachments. 
Any unauthorized use, dissemination, copying, or storage of this message or 
its attachments **is subject to criminal and civil liability** under the 
Data Privacy Act of 2012 (RA 10173)** and the Intellectual Property Code of 
the Philippines (RA 8293), as may be applicable**.*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e11a97c6-2ae0-404d-9c7b-d07651213f9en%40googlegroups.com.

Reply via email to