Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Lorenzo Bolzani Wed, 18 Jul 2018 04:56:06 -0700



A MNIST trained model does character recognition, not detection. You first
need to isolate characters to use it. The advantage is that it is already
trained and I think it may work better than fine tuning tesseract because
the handwritten digits are quite different from standard fonts.



The difference between recognizing characters and words is this:
character: you send individual characters to tesseract
word: you send a big image with the whole word

If the image contains "1234" I can call tesseract four times with four
images 1,2,3,4 and then join the results. Or I can pass just one big image
giving me "1234". You can get the same results in both ways. I know this is
very easy for you, just to make sure that we are talking about the same
thing.

In this case there are the form boxes to complicate things. If you find a
good way to delete the boxes you can do whole words otherwise go for
individual characters.

If you want to do whole words it's much better to train on whole words,
like a full line from the 20 pages.

In theory, for real text/words, doing words is better but here you have
"random" codes and numbers. I think the easier thing here is to go for
individual characters.


There are three completely different tasks that are getting mixed:
1. generating the training data (extracting it from the 20 pages)
2. extracting the individual characters (or words) from real forms
3. doing the actual recognition.

Number 1: roughly follow the blur/threshold + findComponent idea to
generate tiff and gt.txt files. If you use a pre-trained MNIST model (like
in the link I provided) you do not need this step at all, you already have
the trained model (and the training data too).

Number 2: it depends a lot on the quality of the forms. There are a lot of
different ways to do this part. My first attempt would be to realign the
form with a reference template using OpenCV SIFT. This is usually very
precise. Then you can just crop the individual boxes because now you know
the exact pixel position of the form elements. No need for blur, mser or
other things.
Depending on the alignment precision, image quality/size, etc. you may
still have some boxes borders in your crops. You can just move to step 3
and see what happens: maybe it just works and you are done. Otherwise you
need to find a way to delete the boxes borders. I would simply try to take
smaller crops, no big deal if you cut off a few pixels from letters
sometimes. If that fails, custom code, morphology opening, findComponents
wiping the ones too thin or too small, hough lines, etc. depending on how
big the borders are, how many, how often, on how many sides, etc.

Maybe you can do this part (detection/extraction) with tesseract too, try
the hocr output. I've never used it, I do not know if it works with tables.
Maybe it won't work on the whole page but it will work on two small crops
of the upper blocks you are interested in (this means that you still need
to align the page unless your scans are already aligned/oriented).

Use gimp to prepare the images for these tests and see what works betters,
do not waste time doing this "programming first". Crop the region with gimp
and run hocr from command line.

You can also do this part (detection/extraction) with hough lines, template
matching, findComponents, etc. Use what works best, it also depends on
speed requirements.

Number 3: this is easy once you have a small image with just a single
character inside. You do not need to do a binary black/white image, gray is
fine (at least it is what works best for me). You can use a MNIST trained
model or tesseract. If you have enough time try both and see what works
best. For tesseract try different image sizes.


Lorenzo


2018-07-18 12:20 GMT+02:00 Ramakant Kushwaha <ramakant.sing...@gmail.com>:

> @Lorenzo
> As per my understanding MNIST in useful for detecting individual
> char/digit, so for using MNIST I have to do below steps,* correct me if i
> am wrong*
> 1. Gray + Threshold (Opencv)
> 2. Extract Connected components (MSER opencv)
> 3. run a loop over connected components list(sorted) and crop individual
> digit or raw
> 4. pass it to MNIST trained model
> 5. save the result
>
> NOTE: (Here I do not how I will distinguish between account number and
> Mobile number or date)
> I have not tried this, will try above method based on your suggestion
>
> ##TESSERACT
> I am using Tesseract, because I want to extract words(like account
> ,PAN,date,mobile ) and their corresponding values (key value pair
> extraction), thats why I thought it will be good to use tesseract
> what I have done till now
> 1. I am getting scanned image
> 2. crop desired region(PIL + OPENCV)
> 3. gray + blur + threshold
> 3. connected component extraction (MSER+OPENCV)
> 4. black(text) and white(white background)
> 5. passing this(4th step image) to tesseract, it's only detecting digital
> words and numeric, it's also detecting some hand written digits
> 6. Please help me on this step (should i need to use MNIST or Train
> Tesseract for handwritten )
> 7. I am not able to remove border around the digits(Please suggest some
> technique)
>
> I am attaching sample images (origional image)
>
>
> <https://lh3.googleusercontent.com/-HBJ8UOkvrdU/W08S_INjasI/AAAAAAAAJGk/CjxFF12rNC4GdDj1vPEGEhgTiYQRuLoxACLcBGAs/s1600/crop.jpg>
>
>
>
> Image after (blur+threshold+MSER+blank&white)
>
>
> <https://lh3.googleusercontent.com/-KmbRU5rNgQU/W08TQLxXHAI/AAAAAAAAJGs/KxEZgeqtJ1MqI7lzJAde7bJ9PLpv9w0MACLcBGAs/s1600/mser_extracted_text.jpg>
>
>
> Final result will look like:
> Account : 123456789054321
> PAN: CYY******1*
> Mobile:7777788888
> Date:17/07/2018
>
> Please suggest alternative for solving this.
>
> On Wednesday, July 18, 2018 at 2:48:32 PM UTC+5:30, Lorenzo Blz wrote:
>>
>> 
>> This is exactly the MNIST problem
>> <http://corochann.com/mnist-dataset-introduction-1138.html>. I would not
>> use tesseract for this. You can download something like this:
>>
>> https://github.com/EN10/KerasMNIST
>>
>> that comes with pre-trained models too.
>>
>> The problem you'll have will be to extract the digits from the boxes, I
>> would use opencv, probably SIFT to align the form. Then you need to delete
>> the black borders, or just leave them there and see what happens. Or repeat
>> the training adding random black boxes around the digits.
>>
>> So I would first try to understand how you want to extract the data: how
>> your REAL data looks like. There is no point in training on something
>> different. Unless this is an exercise or an assignment and you'll get the
>> digits already extracted.
>>
>> If you want to train tesseract on your images do some blur+threshold so
>> that numbers become black blobs. Run findComponent with opencv. Sort by x
>> and y. Now just iterate over the blobs and crop from the original image and
>> assign the labels (you know the correct one because your sequence is
>> fixed).
>> Delete or manually straighten the skewed lines with gimp to keep things
>> simple.
>>
>> blobs_img = blur + threshold(img)
>> digits = findComponents(blobs_img) and sort
>> i = 0
>> for d in digits:
>>       tiff = crop from original image using d coordinates
>>       gtx.txt = i
>>       i = (i+1)%10
>>
>> Now you have the tiff images and the gt.txt files to run ocr-d.
>>
>> Maybe there are some tools to do this by hand, one digit at a time:
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/AddOns
>>
>>
>> Lorenzo
>>
>>
>> 2018-07-18 8:33 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>
>>> @Soumik,Thanks Soumik, but I am not getting it, please provide me some
>>> links to understand it. I am very new to this thing. can you guide me in
>>> creating text corpus of digit with different fonts
>>>
>>> @Lorenzo, I want to detect digits written in boxex of below image, it's
>>> a cash deposit form of a bank with very complex layout. I have to capture
>>> details of account no and pan no.
>>>
>>>
>>> <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg>
>>>
>>>
>>> On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan
>>> Dasgupta wrote:
>>>>
>>>> Try creating a text corpus with only digits using various handwritten
>>>> fonts that come close to your dataset from fonts.google.com.
>>>> Use tesstrain.sh for rendering the images, and lstmtraining to train
>>>> tesseract - you'll achieve a fair accuracy.
>>>>
>>>> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com>
>>>> wrote:
>>>>
>>>>> 
>>>>> Generating the training data is a completely different problem from
>>>>> training tesseract.
>>>>>
>>>>> If you want to recognize full words it's better to have full words (or
>>>>> numbers), not individual characters so that the process of splitting the
>>>>> words into characters is done by tesseract.
>>>>>
>>>>> Unless you just want to recognize individual characters. This looks
>>>>> more like a MNIST-like task for a simple neural network.
>>>>>
>>>>> I think there are tools to cut images into lines but I've never used
>>>>> one. Or you could do this by programming with opencv.
>>>>>
>>>>> There is no tool to generate the gt.txt you need to write these by
>>>>> hand. In this case your text is very regular so you may just create one
>>>>> line manually (1 2 3 4...) and duplicate that one.  Or you could use a 
>>>>> very
>>>>> good online ocr service.
>>>>>
>>>>>
>>>>> But I'm not convinced this data is good for training. How does the
>>>>> real data that you want to recognize looks like? Individual digits or full
>>>>> numbers?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>>>>
>>>>>> *Thank you so much for guiding me. *
>>>>>>
>>>>>> *I had read links and sub-links provided and as suggested I will use
>>>>>> OCR-D(*https://github.com/OCR-D/ocrd-train*)  for training *
>>>>>> I want to know what is the *best way to create  pairs of [*.tif,
>>>>>> *.gt.txt]  from tif image for two and more fonts . Is their any specific
>>>>>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. *
>>>>>> *I have data like below tiff image(Total 20 images), Please guide me *
>>>>>> *Thank you*
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff>
>>>>>>
>>>>>>
>>>>>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote:
>>>>>>>
>>>>>>> Hi everybody!
>>>>>>>
>>>>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but
>>>>>>> without success so far. Tesseract and Leptonica are installed by the
>>>>>>> scripts.
>>>>>>> Inspired by the test set provided in that repo, I created pairs of
>>>>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 
>>>>>>> text
>>>>>>> lines in total).
>>>>>>> You can see an example of my set in attachment that also contains
>>>>>>> files created by the training process.
>>>>>>>
>>>>>>> My guess is that something is wrong with my data.
>>>>>>> Sometimes I can see the char train value increasing instead of
>>>>>>> decreasing and the final error rate still too high (about 60%).
>>>>>>>
>>>>>>> That new training process with LSTM is driving me crazy!
>>>>>>> I would appreciate if anyone with experience could take a look to my
>>>>>>> data set.
>>>>>>>
>>>>>>>
>>>>>>> Joe.
>>>>>>
>>>>>>
>>>>>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote:
>>>>>>>
>>>>>>>
>>>>>>> Have a look at this thread:
>>>>>>>
>>>>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>>>>>>
>>>>>>>
>>>>>>> It's easier than it seems, you do not need per character boxes with
>>>>>>> 4.0, just one per line (that ocr-d automatically generates). If your 
>>>>>>> text
>>>>>>> is already split into lines you do not have to do anything more.
>>>>>>>
>>>>>>> Unicharset and lstmf files are also created by ocr-d.
>>>>>>>
>>>>>>>
>>>>>>> Feel free to ask if you get stuck, now I have this working but it's
>>>>>>> a bumpy road (lot of assertion failed/segmentation fault if you miss
>>>>>>> something).
>>>>>>>
>>>>>>>
>>>>>>> Bye
>>>>>>>
>>>>>>> Lorenzo
>>>>>>>
>>>>>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>> *Hi,*
>>>>>>>>
>>>>>>>> *Recently I trying to retrain Tesseract 4.0 for recognising
>>>>>>>> handwritten digits. I am following official page but finding it very
>>>>>>>> difficult. It would be great if someone can elaborate below steps*
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Prepare training text.
>>>>>>>>    
>>>>>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I
>>>>>>>>    am using jTessBoxEditor for creating box files )
>>>>>>>>    - Render text to image + box file. (Or create hand-made box
>>>>>>>>    files for existing image data.)
>>>>>>>>    - Make unicharset file. (Can be partially specified, ie created
>>>>>>>>    manually). (Do not how to do this)
>>>>>>>>    - Make a starter traineddata from the unicharset and optional
>>>>>>>>    dictionary data.
>>>>>>>>    
>>>>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>>>>>>>>    - Run tesseract to process image + box file to make training
>>>>>>>>    data set.
>>>>>>>>    - Run training on training data set.
>>>>>>>>    - Combine data files.
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f60
>>>>>>>> 2-42e9-b3b8-121fb151a49e%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b8
>>>>>> 1-4bc2-9eee-4dea8df5c263%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQ
>>>>> iLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Soumik Ranjan Dasgupta
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/479f0447-bb05-4b41-a507-9e571bb5015b%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/479f0447-bb05-4b41-a507-9e571bb5015b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwTjthuGaYRM2uM0ENwkyWG26EiPNfP8cxz18oqoVE9BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Reply via email to