Thanks Lorenzo, I will try OPENCV + SIFT + MNIST, will update you soon.
On Wednesday, July 18, 2018 at 5:26:05 PM UTC+5:30, Lorenzo Blz wrote: > > > > A MNIST trained model does character recognition, not detection. You first > need to isolate characters to use it. The advantage is that it is already > trained and I think it may work better than fine tuning tesseract because > the handwritten digits are quite different from standard fonts. > > > The difference between recognizing characters and words is this: > character: you send individual characters to tesseract > word: you send a big image with the whole word > > If the image contains "1234" I can call tesseract four times with four > images 1,2,3,4 and then join the results. Or I can pass just one big image > giving me "1234". You can get the same results in both ways. I know this is > very easy for you, just to make sure that we are talking about the same > thing. > > In this case there are the form boxes to complicate things. If you find a > good way to delete the boxes you can do whole words otherwise go for > individual characters. > > If you want to do whole words it's much better to train on whole words, > like a full line from the 20 pages. > > In theory, for real text/words, doing words is better but here you have > "random" codes and numbers. I think the easier thing here is to go for > individual characters. > > > There are three completely different tasks that are getting mixed: > 1. generating the training data (extracting it from the 20 pages) > 2. extracting the individual characters (or words) from real forms > 3. doing the actual recognition. > > Number 1: roughly follow the blur/threshold + findComponent idea to > generate tiff and gt.txt files. If you use a pre-trained MNIST model (like > in the link I provided) you do not need this step at all, you already have > the trained model (and the training data too). > > Number 2: it depends a lot on the quality of the forms. There are a lot of > different ways to do this part. My first attempt would be to realign the > form with a reference template using OpenCV SIFT. This is usually very > precise. Then you can just crop the individual boxes because now you know > the exact pixel position of the form elements. No need for blur, mser or > other things. > Depending on the alignment precision, image quality/size, etc. you may > still have some boxes borders in your crops. You can just move to step 3 > and see what happens: maybe it just works and you are done. Otherwise you > need to find a way to delete the boxes borders. I would simply try to take > smaller crops, no big deal if you cut off a few pixels from letters > sometimes. If that fails, custom code, morphology opening, findComponents > wiping the ones too thin or too small, hough lines, etc. depending on how > big the borders are, how many, how often, on how many sides, etc. > > Maybe you can do this part (detection/extraction) with tesseract too, try > the hocr output. I've never used it, I do not know if it works with tables. > Maybe it won't work on the whole page but it will work on two small crops > of the upper blocks you are interested in (this means that you still need > to align the page unless your scans are already aligned/oriented). > > Use gimp to prepare the images for these tests and see what works betters, > do not waste time doing this "programming first". Crop the region with gimp > and run hocr from command line. > > You can also do this part (detection/extraction) with hough lines, > template matching, findComponents, etc. Use what works best, it also > depends on speed requirements. > > Number 3: this is easy once you have a small image with just a single > character inside. You do not need to do a binary black/white image, gray is > fine (at least it is what works best for me). You can use a MNIST trained > model or tesseract. If you have enough time try both and see what works > best. For tesseract try different image sizes. > > > Lorenzo > > > 2018-07-18 12:20 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com > <javascript:>>: > >> @Lorenzo >> As per my understanding MNIST in useful for detecting individual >> char/digit, so for using MNIST I have to do below steps,* correct me if >> i am wrong* >> 1. Gray + Threshold (Opencv) >> 2. Extract Connected components (MSER opencv) >> 3. run a loop over connected components list(sorted) and crop individual >> digit or raw >> 4. pass it to MNIST trained model >> 5. save the result >> >> NOTE: (Here I do not how I will distinguish between account number and >> Mobile number or date) >> I have not tried this, will try above method based on your suggestion >> >> ##TESSERACT >> I am using Tesseract, because I want to extract words(like account >> ,PAN,date,mobile ) and their corresponding values (key value pair >> extraction), thats why I thought it will be good to use tesseract >> what I have done till now >> 1. I am getting scanned image >> 2. crop desired region(PIL + OPENCV) >> 3. gray + blur + threshold >> 3. connected component extraction (MSER+OPENCV) >> 4. black(text) and white(white background) >> 5. passing this(4th step image) to tesseract, it's only detecting digital >> words and numeric, it's also detecting some hand written digits >> 6. Please help me on this step (should i need to use MNIST or Train >> Tesseract for handwritten ) >> 7. I am not able to remove border around the digits(Please suggest some >> technique) >> >> I am attaching sample images (origional image) >> >> >> <https://lh3.googleusercontent.com/-HBJ8UOkvrdU/W08S_INjasI/AAAAAAAAJGk/CjxFF12rNC4GdDj1vPEGEhgTiYQRuLoxACLcBGAs/s1600/crop.jpg> >> >> >> >> Image after (blur+threshold+MSER+blank&white) >> >> >> <https://lh3.googleusercontent.com/-KmbRU5rNgQU/W08TQLxXHAI/AAAAAAAAJGs/KxEZgeqtJ1MqI7lzJAde7bJ9PLpv9w0MACLcBGAs/s1600/mser_extracted_text.jpg> >> >> >> Final result will look like: >> Account : 123456789054321 >> PAN: CYY******1* >> Mobile:7777788888 >> Date:17/07/2018 >> >> Please suggest alternative for solving this. >> >> On Wednesday, July 18, 2018 at 2:48:32 PM UTC+5:30, Lorenzo Blz wrote: >>> >>> >>> This is exactly the MNIST problem >>> <http://corochann.com/mnist-dataset-introduction-1138.html>. I would >>> not use tesseract for this. You can download something like this: >>> >>> https://github.com/EN10/KerasMNIST >>> >>> that comes with pre-trained models too. >>> >>> The problem you'll have will be to extract the digits from the boxes, I >>> would use opencv, probably SIFT to align the form. Then you need to delete >>> the black borders, or just leave them there and see what happens. Or repeat >>> the training adding random black boxes around the digits. >>> >>> So I would first try to understand how you want to extract the data: how >>> your REAL data looks like. There is no point in training on something >>> different. Unless this is an exercise or an assignment and you'll get the >>> digits already extracted. >>> >>> If you want to train tesseract on your images do some blur+threshold so >>> that numbers become black blobs. Run findComponent with opencv. Sort by x >>> and y. Now just iterate over the blobs and crop from the original image and >>> assign the labels (you know the correct one because your sequence is >>> fixed). >>> Delete or manually straighten the skewed lines with gimp to keep things >>> simple. >>> >>> blobs_img = blur + threshold(img) >>> digits = findComponents(blobs_img) and sort >>> i = 0 >>> for d in digits: >>> tiff = crop from original image using d coordinates >>> gtx.txt = i >>> i = (i+1)%10 >>> >>> Now you have the tiff images and the gt.txt files to run ocr-d. >>> >>> Maybe there are some tools to do this by hand, one digit at a time: >>> >>> https://github.com/tesseract-ocr/tesseract/wiki/AddOns >>> >>> >>> Lorenzo >>> >>> >>> 2018-07-18 8:33 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>: >>> >>>> @Soumik,Thanks Soumik, but I am not getting it, please provide me some >>>> links to understand it. I am very new to this thing. can you guide me in >>>> creating text corpus of digit with different fonts >>>> >>>> @Lorenzo, I want to detect digits written in boxex of below image, it's >>>> a cash deposit form of a bank with very complex layout. I have to capture >>>> details of account no and pan no. >>>> >>>> >>>> <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg> >>>> >>>> >>>> On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan >>>> Dasgupta wrote: >>>>> >>>>> Try creating a text corpus with only digits using various handwritten >>>>> fonts that come close to your dataset from fonts.google.com. >>>>> Use tesstrain.sh for rendering the images, and lstmtraining to train >>>>> tesseract - you'll achieve a fair accuracy. >>>>> >>>>> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> Generating the training data is a completely different problem from >>>>>> training tesseract. >>>>>> >>>>>> If you want to recognize full words it's better to have full words >>>>>> (or numbers), not individual characters so that the process of splitting >>>>>> the words into characters is done by tesseract. >>>>>> >>>>>> Unless you just want to recognize individual characters. This looks >>>>>> more like a MNIST-like task for a simple neural network. >>>>>> >>>>>> I think there are tools to cut images into lines but I've never used >>>>>> one. Or you could do this by programming with opencv. >>>>>> >>>>>> There is no tool to generate the gt.txt you need to write these by >>>>>> hand. In this case your text is very regular so you may just create one >>>>>> line manually (1 2 3 4...) and duplicate that one. Or you could use a >>>>>> very >>>>>> good online ocr service. >>>>>> >>>>>> >>>>>> But I'm not convinced this data is good for training. How does the >>>>>> real data that you want to recognize looks like? Individual digits or >>>>>> full >>>>>> numbers? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>: >>>>>> >>>>>>> *Thank you so much for guiding me. * >>>>>>> >>>>>>> *I had read links and sub-links provided and as suggested I will use >>>>>>> OCR-D(*https://github.com/OCR-D/ocrd-train*) for training * >>>>>>> I want to know what is the *best way to create pairs of [*.tif, >>>>>>> *.gt.txt] from tif image for two and more fonts . Is their any >>>>>>> specific >>>>>>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. * >>>>>>> *I have data like below tiff image(Total 20 images), Please guide >>>>>>> me * >>>>>>> *Thank you* >>>>>>> >>>>>>> >>>>>>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff> >>>>>>> >>>>>>> >>>>>>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote: >>>>>>>> >>>>>>>> Hi everybody! >>>>>>>> >>>>>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but >>>>>>>> without success so far. Tesseract and Leptonica are installed by the >>>>>>>> scripts. >>>>>>>> Inspired by the test set provided in that repo, I created pairs of >>>>>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 >>>>>>>> text >>>>>>>> lines in total). >>>>>>>> You can see an example of my set in attachment that also contains >>>>>>>> files created by the training process. >>>>>>>> >>>>>>>> My guess is that something is wrong with my data. >>>>>>>> Sometimes I can see the char train value increasing instead of >>>>>>>> decreasing and the final error rate still too high (about 60%). >>>>>>>> >>>>>>>> That new training process with LSTM is driving me crazy! >>>>>>>> I would appreciate if anyone with experience could take a look to >>>>>>>> my data set. >>>>>>>> >>>>>>>> >>>>>>>> Joe. >>>>>>> >>>>>>> >>>>>>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote: >>>>>>>> >>>>>>>> >>>>>>>> Have a look at this thread: >>>>>>>> >>>>>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ >>>>>>>> >>>>>>>> >>>>>>>> It's easier than it seems, you do not need per character boxes with >>>>>>>> 4.0, just one per line (that ocr-d automatically generates). If your >>>>>>>> text >>>>>>>> is already split into lines you do not have to do anything more. >>>>>>>> >>>>>>>> Unicharset and lstmf files are also created by ocr-d. >>>>>>>> >>>>>>>> >>>>>>>> Feel free to ask if you get stuck, now I have this working but it's >>>>>>>> a bumpy road (lot of assertion failed/segmentation fault if you miss >>>>>>>> something). >>>>>>>> >>>>>>>> >>>>>>>> Bye >>>>>>>> >>>>>>>> Lorenzo >>>>>>>> >>>>>>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com >>>>>>>> >: >>>>>>>> >>>>>>>>> *Hi,* >>>>>>>>> >>>>>>>>> *Recently I trying to retrain Tesseract 4.0 for recognising >>>>>>>>> handwritten digits. I am following official page but finding it very >>>>>>>>> difficult. It would be great if someone can elaborate below steps* >>>>>>>>> >>>>>>>>> >>>>>>>>> - Prepare training text. >>>>>>>>> >>>>>>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I >>>>>>>>> >>>>>>>>> am using jTessBoxEditor for creating box files ) >>>>>>>>> - Render text to image + box file. (Or create hand-made box >>>>>>>>> files for existing image data.) >>>>>>>>> - Make unicharset file. (Can be partially specified, ie >>>>>>>>> created manually). (Do not how to do this) >>>>>>>>> - Make a starter traineddata from the unicharset and optional >>>>>>>>> dictionary data. >>>>>>>>> >>>>>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata> >>>>>>>>> - Run tesseract to process image + box file to make training >>>>>>>>> data set. >>>>>>>>> - Run training on training data set. >>>>>>>>> - Combine data files. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Soumik Ranjan Dasgupta >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/479f0447-bb05-4b41-a507-9e571bb5015b%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/479f0447-bb05-4b41-a507-9e571bb5015b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aadc5ff2-328d-4aa1-9923-1bd871ba6706%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.