Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Ramakant Kushwaha Wed, 18 Jul 2018 23:02:45 -0700

Thanks Lorenzo, 
I will try OPENCV + SIFT + MNIST, will update you soon.


On Wednesday, July 18, 2018 at 5:26:05 PM UTC+5:30, Lorenzo Blz wrote:
>
> 
>
> A MNIST trained model does character recognition, not detection. You first 
> need to isolate characters to use it. The advantage is that it is already 
> trained and I think it may work better than fine tuning tesseract because 
> the handwritten digits are quite different from standard fonts.
>
>
> The difference between recognizing characters and words is this:
> character: you send individual characters to tesseract 
> word: you send a big image with the whole word
>
> If the image contains "1234" I can call tesseract four times with four 
> images 1,2,3,4 and then join the results. Or I can pass just one big image 
> giving me "1234". You can get the same results in both ways. I know this is 
> very easy for you, just to make sure that we are talking about the same 
> thing.
>
> In this case there are the form boxes to complicate things. If you find a 
> good way to delete the boxes you can do whole words otherwise go for 
> individual characters.
>
> If you want to do whole words it's much better to train on whole words, 
> like a full line from the 20 pages. 
>
> In theory, for real text/words, doing words is better but here you have 
> "random" codes and numbers. I think the easier thing here is to go for 
> individual characters.
>
>
> There are three completely different tasks that are getting mixed:
> 1. generating the training data (extracting it from the 20 pages)
> 2. extracting the individual characters (or words) from real forms
> 3. doing the actual recognition.
>
> Number 1: roughly follow the blur/threshold + findComponent idea to 
> generate tiff and gt.txt files. If you use a pre-trained MNIST model (like 
> in the link I provided) you do not need this step at all, you already have 
> the trained model (and the training data too).
>
> Number 2: it depends a lot on the quality of the forms. There are a lot of 
> different ways to do this part. My first attempt would be to realign the 
> form with a reference template using OpenCV SIFT. This is usually very 
> precise. Then you can just crop the individual boxes because now you know 
> the exact pixel position of the form elements. No need for blur, mser or 
> other things.
> Depending on the alignment precision, image quality/size, etc. you may 
> still have some boxes borders in your crops. You can just move to step 3 
> and see what happens: maybe it just works and you are done. Otherwise you 
> need to find a way to delete the boxes borders. I would simply try to take 
> smaller crops, no big deal if you cut off a few pixels from letters 
> sometimes. If that fails, custom code, morphology opening, findComponents 
> wiping the ones too thin or too small, hough lines, etc. depending on how 
> big the borders are, how many, how often, on how many sides, etc.
>
> Maybe you can do this part (detection/extraction) with tesseract too, try 
> the hocr output. I've never used it, I do not know if it works with tables. 
> Maybe it won't work on the whole page but it will work on two small crops 
> of the upper blocks you are interested in (this means that you still need 
> to align the page unless your scans are already aligned/oriented).
>
> Use gimp to prepare the images for these tests and see what works betters, 
> do not waste time doing this "programming first". Crop the region with gimp 
> and run hocr from command line.
>
> You can also do this part (detection/extraction) with hough lines, 
> template matching, findComponents, etc. Use what works best, it also 
> depends on speed requirements.
>
> Number 3: this is easy once you have a small image with just a single 
> character inside. You do not need to do a binary black/white image, gray is 
> fine (at least it is what works best for me). You can use a MNIST trained 
> model or tesseract. If you have enough time try both and see what works 
> best. For tesseract try different image sizes.
>
>
> Lorenzo
>
>
> 2018-07-18 12:20 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com 
> <javascript:>>:
>
>> @Lorenzo
>> As per my understanding MNIST in useful for detecting individual 
>> char/digit, so for using MNIST I have to do below steps,* correct me if 
>> i am wrong*
>> 1. Gray + Threshold (Opencv)
>> 2. Extract Connected components (MSER opencv)
>> 3. run a loop over connected components list(sorted) and crop individual 
>> digit or raw
>> 4. pass it to MNIST trained model 
>> 5. save the result
>>
>> NOTE: (Here I do not how I will distinguish between account number and 
>> Mobile number or date)
>> I have not tried this, will try above method based on your suggestion 
>>
>> ##TESSERACT
>> I am using Tesseract, because I want to extract words(like account 
>> ,PAN,date,mobile ) and their corresponding values (key value pair 
>> extraction), thats why I thought it will be good to use tesseract 
>> what I have done till now
>> 1. I am getting scanned image
>> 2. crop desired region(PIL + OPENCV)
>> 3. gray + blur + threshold
>> 3. connected component extraction (MSER+OPENCV)
>> 4. black(text) and white(white background)
>> 5. passing this(4th step image) to tesseract, it's only detecting digital 
>> words and numeric, it's also detecting some hand written digits 
>> 6. Please help me on this step (should i need to use MNIST or Train 
>> Tesseract for handwritten )
>> 7. I am not able to remove border around the digits(Please suggest some 
>> technique)
>>
>> I am attaching sample images (origional image)
>>
>>
>> <https://lh3.googleusercontent.com/-HBJ8UOkvrdU/W08S_INjasI/AAAAAAAAJGk/CjxFF12rNC4GdDj1vPEGEhgTiYQRuLoxACLcBGAs/s1600/crop.jpg>
>>
>>
>>
>> Image after (blur+threshold+MSER+blank&white)
>>
>>
>> <https://lh3.googleusercontent.com/-KmbRU5rNgQU/W08TQLxXHAI/AAAAAAAAJGs/KxEZgeqtJ1MqI7lzJAde7bJ9PLpv9w0MACLcBGAs/s1600/mser_extracted_text.jpg>
>>   
>>
>> Final result will look like:
>> Account : 123456789054321
>> PAN: CYY******1*
>> Mobile:7777788888
>> Date:17/07/2018
>>
>> Please suggest alternative for solving this. 
>>
>> On Wednesday, July 18, 2018 at 2:48:32 PM UTC+5:30, Lorenzo Blz wrote:
>>>
>>> 
>>> This is exactly the MNIST problem 
>>> <http://corochann.com/mnist-dataset-introduction-1138.html>. I would 
>>> not use tesseract for this. You can download something like this:
>>>
>>> https://github.com/EN10/KerasMNIST
>>>
>>> that comes with pre-trained models too.
>>>
>>> The problem you'll have will be to extract the digits from the boxes, I 
>>> would use opencv, probably SIFT to align the form. Then you need to delete 
>>> the black borders, or just leave them there and see what happens. Or repeat 
>>> the training adding random black boxes around the digits.
>>>
>>> So I would first try to understand how you want to extract the data: how 
>>> your REAL data looks like. There is no point in training on something 
>>> different. Unless this is an exercise or an assignment and you'll get the 
>>> digits already extracted. 
>>>
>>> If you want to train tesseract on your images do some blur+threshold so 
>>> that numbers become black blobs. Run findComponent with opencv. Sort by x 
>>> and y. Now just iterate over the blobs and crop from the original image and 
>>> assign the labels (you know the correct one because your sequence is 
>>> fixed). 
>>> Delete or manually straighten the skewed lines with gimp to keep things 
>>> simple.
>>>
>>> blobs_img = blur + threshold(img)
>>> digits = findComponents(blobs_img) and sort
>>> i = 0
>>> for d in digits:
>>>       tiff = crop from original image using d coordinates
>>>       gtx.txt = i
>>>       i = (i+1)%10
>>>
>>> Now you have the tiff images and the gt.txt files to run ocr-d.
>>>
>>> Maybe there are some tools to do this by hand, one digit at a time:
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/AddOns
>>>
>>>
>>> Lorenzo
>>>
>>>
>>> 2018-07-18 8:33 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>>
>>>> @Soumik,Thanks Soumik, but I am not getting it, please provide me some 
>>>> links to understand it. I am very new to this thing. can you guide me in 
>>>> creating text corpus of digit with different fonts
>>>>
>>>> @Lorenzo, I want to detect digits written in boxex of below image, it's 
>>>> a cash deposit form of a bank with very complex layout. I have to capture 
>>>> details of account no and pan no.
>>>>
>>>>
>>>> <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg>
>>>>  
>>>>
>>>> On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan 
>>>> Dasgupta wrote:
>>>>>
>>>>> Try creating a text corpus with only digits using various handwritten 
>>>>> fonts that come close to your dataset from fonts.google.com. 
>>>>> Use tesstrain.sh for rendering the images, and lstmtraining to train 
>>>>> tesseract - you'll achieve a fair accuracy.
>>>>>
>>>>> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> 
>>>>>> Generating the training data is a completely different problem from 
>>>>>> training tesseract.
>>>>>>
>>>>>> If you want to recognize full words it's better to have full words 
>>>>>> (or numbers), not individual characters so that the process of splitting 
>>>>>> the words into characters is done by tesseract. 
>>>>>>
>>>>>> Unless you just want to recognize individual characters. This looks 
>>>>>> more like a MNIST-like task for a simple neural network.
>>>>>>
>>>>>> I think there are tools to cut images into lines but I've never used 
>>>>>> one. Or you could do this by programming with opencv.
>>>>>>
>>>>>> There is no tool to generate the gt.txt you need to write these by 
>>>>>> hand. In this case your text is very regular so you may just create one 
>>>>>> line manually (1 2 3 4...) and duplicate that one.  Or you could use a 
>>>>>> very 
>>>>>> good online ocr service. 
>>>>>>
>>>>>>
>>>>>> But I'm not convinced this data is good for training. How does the 
>>>>>> real data that you want to recognize looks like? Individual digits or 
>>>>>> full 
>>>>>> numbers? 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
>>>>>>
>>>>>>> *Thank you so much for guiding me. *
>>>>>>>
>>>>>>> *I had read links and sub-links provided and as suggested I will use 
>>>>>>> OCR-D(*https://github.com/OCR-D/ocrd-train*)  for training *
>>>>>>> I want to know what is the *best way to create  pairs of [*.tif, 
>>>>>>> *.gt.txt]  from tif image for two and more fonts . Is their any 
>>>>>>> specific 
>>>>>>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. *
>>>>>>> *I have data like below tiff image(Total 20 images), Please guide 
>>>>>>> me *
>>>>>>> *Thank you*
>>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote:
>>>>>>>>
>>>>>>>> Hi everybody!
>>>>>>>>
>>>>>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but 
>>>>>>>> without success so far. Tesseract and Leptonica are installed by the 
>>>>>>>> scripts.
>>>>>>>> Inspired by the test set provided in that repo, I created pairs of 
>>>>>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 
>>>>>>>> text 
>>>>>>>> lines in total).
>>>>>>>> You can see an example of my set in attachment that also contains 
>>>>>>>> files created by the training process.
>>>>>>>>
>>>>>>>> My guess is that something is wrong with my data.
>>>>>>>> Sometimes I can see the char train value increasing instead of 
>>>>>>>> decreasing and the final error rate still too high (about 60%).
>>>>>>>>
>>>>>>>> That new training process with LSTM is driving me crazy!
>>>>>>>> I would appreciate if anyone with experience could take a look to 
>>>>>>>> my data set.
>>>>>>>>
>>>>>>>>
>>>>>>>> Joe. 
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Have a look at this thread:
>>>>>>>>
>>>>>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>>>>>>>>
>>>>>>>>
>>>>>>>> It's easier than it seems, you do not need per character boxes with 
>>>>>>>> 4.0, just one per line (that ocr-d automatically generates). If your 
>>>>>>>> text 
>>>>>>>> is already split into lines you do not have to do anything more.
>>>>>>>>
>>>>>>>> Unicharset and lstmf files are also created by ocr-d.
>>>>>>>>
>>>>>>>>
>>>>>>>> Feel free to ask if you get stuck, now I have this working but it's 
>>>>>>>> a bumpy road (lot of assertion failed/segmentation fault if you miss 
>>>>>>>> something). 
>>>>>>>>
>>>>>>>>
>>>>>>>> Bye
>>>>>>>>
>>>>>>>> Lorenzo
>>>>>>>>
>>>>>>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> *Hi,*
>>>>>>>>>
>>>>>>>>> *Recently I trying to retrain Tesseract 4.0 for recognising 
>>>>>>>>> handwritten digits. I am following official page but finding it very 
>>>>>>>>> difficult. It would be great if someone can elaborate below steps*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - Prepare training text. 
>>>>>>>>>    
>>>>>>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I
>>>>>>>>>  
>>>>>>>>>    am using jTessBoxEditor for creating box files )
>>>>>>>>>    - Render text to image + box file. (Or create hand-made box 
>>>>>>>>>    files for existing image data.)
>>>>>>>>>    - Make unicharset file. (Can be partially specified, ie 
>>>>>>>>>    created manually). (Do not how to do this)
>>>>>>>>>    - Make a starter traineddata from the unicharset and optional 
>>>>>>>>>    dictionary data. 
>>>>>>>>>    
>>>>>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>>>>>>>>>    - Run tesseract to process image + box file to make training 
>>>>>>>>>    data set.
>>>>>>>>>    - Run training on training data set.
>>>>>>>>>    - Combine data files.
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Regards,
>>>>> Soumik Ranjan Dasgupta
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/479f0447-bb05-4b41-a507-9e571bb5015b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/479f0447-bb05-4b41-a507-9e571bb5015b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/aadc5ff2-328d-4aa1-9923-1bd871ba6706%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Retrain Tesseract 4.0.0 beta to recognise handwritten digits

Reply via email to