Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Soumik Ranjan Dasgupta Wed, 18 Jul 2018 00:50:55 -0700

I normally use a custom python file to generate the training text.
Attaching a sample text corpus containing only digits 1234.


On Wed, Jul 18, 2018 at 12:04 PM Ramakant Kushwaha <
ramakant.sing...@gmail.com> wrote:

> @Soumik,Thanks Soumik, but I am not getting it, please provide me some
> links to understand it. I am very new to this thing. can you guide me in
> creating text corpus of digit with different fonts
>
> @Lorenzo, I want to detect digits written in boxex of below image, it's a
> cash deposit form of a bank with very complex layout. I have to capture
> details of account no and pan no.
>
>
> <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg>
>
>
> On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan
> Dasgupta wrote:
>>
>> Try creating a text corpus with only digits using various handwritten
>> fonts that come close to your dataset from fonts.google.com.
>> Use tesstrain.sh for rendering the images, and lstmtraining to train
>> tesseract - you'll achieve a fair accuracy.
>>
>> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com>
>> wrote:
>>
>>> 
>>> Generating the training data is a completely different problem from
>>> training tesseract.
>>>
>>> If you want to recognize full words it's better to have full words (or
>>> numbers), not individual characters so that the process of splitting the
>>> words into characters is done by tesseract.
>>>
>>> Unless you just want to recognize individual characters. This looks more
>>> like a MNIST-like task for a simple neural network.
>>>
>>> I think there are tools to cut images into lines but I've never used
>>> one. Or you could do this by programming with opencv.
>>>
>>> There is no tool to generate the gt.txt you need to write these by hand.
>>> In this case your text is very regular so you may just create one line
>>> manually (1 2 3 4...) and duplicate that one.  Or you could use a very good
>>> online ocr service.
>>>
>>>
>>> But I'm not convinced this data is good for training. How does the real
>>> data that you want to recognize looks like? Individual digits or full
>>> numbers?
>>>
>>>
>>>
>>>
>>> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>>
>>>> *Thank you so much for guiding me. *
>>>>
>>>> *I had read links and sub-links provided and as suggested I will use
>>>> OCR-D(*https://github.com/OCR-D/ocrd-train*)  for training *
>>>> I want to know what is the *best way to create  pairs of [*.tif,
>>>> *.gt.txt]  from tif image for two and more fonts . Is their any specific
>>>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. *
>>>> *I have data like below tiff image(Total 20 images), Please guide me *
>>>> *Thank you*
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff>
>>>>
>>>>
>>>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote:
>>>>>
>>>>> Hi everybody!
>>>>>
>>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without
>>>>> success so far. Tesseract and Leptonica are installed by the scripts.
>>>>> Inspired by the test set provided in that repo, I created pairs of
>>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text
>>>>> lines in total).
>>>>> You can see an example of my set in attachment that also contains
>>>>> files created by the training process.
>>>>>
>>>>> My guess is that something is wrong with my data.
>>>>> Sometimes I can see the char train value increasing instead of
>>>>> decreasing and the final error rate still too high (about 60%).
>>>>>
>>>>> That new training process with LSTM is driving me crazy!
>>>>> I would appreciate if anyone with experience could take a look to my
>>>>> data set.
>>>>>
>>>>>
>>>>> Joe.
>>>>
>>>>
>>>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote:
>>>>>
>>>>>
>>>>> Have a look at this thread:
>>>>>
>>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>>>>
>>>>>
>>>>> It's easier than it seems, you do not need per character boxes with
>>>>> 4.0, just one per line (that ocr-d automatically generates). If your text
>>>>> is already split into lines you do not have to do anything more.
>>>>>
>>>>> Unicharset and lstmf files are also created by ocr-d.
>>>>>
>>>>>
>>>>> Feel free to ask if you get stuck, now I have this working but it's a
>>>>> bumpy road (lot of assertion failed/segmentation fault if you miss
>>>>> something).
>>>>>
>>>>>
>>>>> Bye
>>>>>
>>>>> Lorenzo
>>>>>
>>>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>>>>
>>>>>> *Hi,*
>>>>>>
>>>>>> *Recently I trying to retrain Tesseract 4.0 for recognising
>>>>>> handwritten digits. I am following official page but finding it very
>>>>>> difficult. It would be great if someone can elaborate below steps*
>>>>>>
>>>>>>
>>>>>>    - Prepare training text.
>>>>>>    
>>>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I
>>>>>>    am using jTessBoxEditor for creating box files )
>>>>>>    - Render text to image + box file. (Or create hand-made box files
>>>>>>    for existing image data.)
>>>>>>    - Make unicharset file. (Can be partially specified, ie created
>>>>>>    manually). (Do not how to do this)
>>>>>>    - Make a starter traineddata from the unicharset and optional
>>>>>>    dictionary data.
>>>>>>    
>>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>>>>>>    - Run tesseract to process image + box file to make training data
>>>>>>    set.
>>>>>>    - Run training on training data set.
>>>>>>    - Combine data files.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> Regards,
>> Soumik Ranjan Dasgupta
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Regards,
Soumik Ranjan Dasgupta

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAB_aDAeLYPAo-dgrGNbinxS0tywm0hCvnEBDLPimx21vkRN7oA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

eng.training_text
Description: Binary data

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Reply via email to