Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Ramakant Kushwaha Tue, 17 Jul 2018 23:34:14 -0700

@Soumik,Thanks Soumik, but I am not getting it, please provide me some 
links to understand it. I am very new to this thing. can you guide me in 
creating text corpus of digit with different fonts


@Lorenzo, I want to detect digits written in boxex of below image, it's a 
cash deposit form of a bank with very complex layout. I have to capture 
details of account no and pan no.

<https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg>
 

On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan Dasgupta 
wrote:
>
> Try creating a text corpus with only digits using various handwritten 
> fonts that come close to your dataset from fonts.google.com. 
> Use tesstrain.sh for rendering the images, and lstmtraining to train 
> tesseract - you'll achieve a fair accuracy.
>
> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com 
> <javascript:>> wrote:
>
>> 
>> Generating the training data is a completely different problem from 
>> training tesseract.
>>
>> If you want to recognize full words it's better to have full words (or 
>> numbers), not individual characters so that the process of splitting the 
>> words into characters is done by tesseract. 
>>
>> Unless you just want to recognize individual characters. This looks more 
>> like a MNIST-like task for a simple neural network.
>>
>> I think there are tools to cut images into lines but I've never used one. 
>> Or you could do this by programming with opencv.
>>
>> There is no tool to generate the gt.txt you need to write these by hand. 
>> In this case your text is very regular so you may just create one line 
>> manually (1 2 3 4...) and duplicate that one.  Or you could use a very good 
>> online ocr service. 
>>
>>
>> But I'm not convinced this data is good for training. How does the real 
>> data that you want to recognize looks like? Individual digits or full 
>> numbers? 
>>
>>
>>
>>
>> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com 
>> <javascript:>>:
>>
>>> *Thank you so much for guiding me. *
>>>
>>> *I had read links and sub-links provided and as suggested I will use 
>>> OCR-D(*https://github.com/OCR-D/ocrd-train*)  for training *
>>> I want to know what is the *best way to create  pairs of [*.tif, 
>>> *.gt.txt]  from tif image for two and more fonts . Is their any specific 
>>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. *
>>> *I have data like below tiff image(Total 20 images), Please guide me *
>>> *Thank you*
>>>
>>>
>>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff>
>>>
>>>
>>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote:
>>>>
>>>> Hi everybody!
>>>>
>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without 
>>>> success so far. Tesseract and Leptonica are installed by the scripts.
>>>> Inspired by the test set provided in that repo, I created pairs of 
>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text 
>>>> lines in total).
>>>> You can see an example of my set in attachment that also contains files 
>>>> created by the training process.
>>>>
>>>> My guess is that something is wrong with my data.
>>>> Sometimes I can see the char train value increasing instead of 
>>>> decreasing and the final error rate still too high (about 60%).
>>>>
>>>> That new training process with LSTM is driving me crazy!
>>>> I would appreciate if anyone with experience could take a look to my 
>>>> data set.
>>>>
>>>>
>>>> Joe. 
>>>
>>>
>>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote:
>>>>
>>>>
>>>> Have a look at this thread:
>>>>
>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>>>
>>>>
>>>> It's easier than it seems, you do not need per character boxes with 
>>>> 4.0, just one per line (that ocr-d automatically generates). If your text 
>>>> is already split into lines you do not have to do anything more.
>>>>
>>>> Unicharset and lstmf files are also created by ocr-d.
>>>>
>>>>
>>>> Feel free to ask if you get stuck, now I have this working but it's a 
>>>> bumpy road (lot of assertion failed/segmentation fault if you miss 
>>>> something). 
>>>>
>>>>
>>>> Bye
>>>>
>>>> Lorenzo
>>>>
>>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>>>
>>>>> *Hi,*
>>>>>
>>>>> *Recently I trying to retrain Tesseract 4.0 for recognising 
>>>>> handwritten digits. I am following official page but finding it very 
>>>>> difficult. It would be great if someone can elaborate below steps*
>>>>>
>>>>>
>>>>>    - Prepare training text. 
>>>>>    
>>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I
>>>>>  
>>>>>    am using jTessBoxEditor for creating box files )
>>>>>    - Render text to image + box file. (Or create hand-made box files 
>>>>>    for existing image data.)
>>>>>    - Make unicharset file. (Can be partially specified, ie created 
>>>>>    manually). (Do not how to do this)
>>>>>    - Make a starter traineddata from the unicharset and optional 
>>>>>    dictionary data. 
>>>>>    
>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>>>>>    - Run tesseract to process image + box file to make training data 
>>>>>    set.
>>>>>    - Run training on training data set.
>>>>>    - Combine data files.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com <javascript:>.
>>> To post to this group, send email to tesser...@googlegroups.com 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
> Regards,
> Soumik Ranjan Dasgupta
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Reply via email to