@Soumik,Thanks Soumik, but I am not getting it, please provide me some links to understand it. I am very new to this thing. can you guide me in creating text corpus of digit with different fonts
@Lorenzo, I want to detect digits written in boxex of below image, it's a cash deposit form of a bank with very complex layout. I have to capture details of account no and pan no. <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg> On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan Dasgupta wrote: > > Try creating a text corpus with only digits using various handwritten > fonts that come close to your dataset from fonts.google.com. > Use tesstrain.sh for rendering the images, and lstmtraining to train > tesseract - you'll achieve a fair accuracy. > > On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com > <javascript:>> wrote: > >> >> Generating the training data is a completely different problem from >> training tesseract. >> >> If you want to recognize full words it's better to have full words (or >> numbers), not individual characters so that the process of splitting the >> words into characters is done by tesseract. >> >> Unless you just want to recognize individual characters. This looks more >> like a MNIST-like task for a simple neural network. >> >> I think there are tools to cut images into lines but I've never used one. >> Or you could do this by programming with opencv. >> >> There is no tool to generate the gt.txt you need to write these by hand. >> In this case your text is very regular so you may just create one line >> manually (1 2 3 4...) and duplicate that one. Or you could use a very good >> online ocr service. >> >> >> But I'm not convinced this data is good for training. How does the real >> data that you want to recognize looks like? Individual digits or full >> numbers? >> >> >> >> >> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com >> <javascript:>>: >> >>> *Thank you so much for guiding me. * >>> >>> *I had read links and sub-links provided and as suggested I will use >>> OCR-D(*https://github.com/OCR-D/ocrd-train*) for training * >>> I want to know what is the *best way to create pairs of [*.tif, >>> *.gt.txt] from tif image for two and more fonts . Is their any specific >>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. * >>> *I have data like below tiff image(Total 20 images), Please guide me * >>> *Thank you* >>> >>> >>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff> >>> >>> >>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote: >>>> >>>> Hi everybody! >>>> >>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without >>>> success so far. Tesseract and Leptonica are installed by the scripts. >>>> Inspired by the test set provided in that repo, I created pairs of >>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text >>>> lines in total). >>>> You can see an example of my set in attachment that also contains files >>>> created by the training process. >>>> >>>> My guess is that something is wrong with my data. >>>> Sometimes I can see the char train value increasing instead of >>>> decreasing and the final error rate still too high (about 60%). >>>> >>>> That new training process with LSTM is driving me crazy! >>>> I would appreciate if anyone with experience could take a look to my >>>> data set. >>>> >>>> >>>> Joe. >>> >>> >>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote: >>>> >>>> >>>> Have a look at this thread: >>>> >>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ >>>> >>>> >>>> It's easier than it seems, you do not need per character boxes with >>>> 4.0, just one per line (that ocr-d automatically generates). If your text >>>> is already split into lines you do not have to do anything more. >>>> >>>> Unicharset and lstmf files are also created by ocr-d. >>>> >>>> >>>> Feel free to ask if you get stuck, now I have this working but it's a >>>> bumpy road (lot of assertion failed/segmentation fault if you miss >>>> something). >>>> >>>> >>>> Bye >>>> >>>> Lorenzo >>>> >>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>: >>>> >>>>> *Hi,* >>>>> >>>>> *Recently I trying to retrain Tesseract 4.0 for recognising >>>>> handwritten digits. I am following official page but finding it very >>>>> difficult. It would be great if someone can elaborate below steps* >>>>> >>>>> >>>>> - Prepare training text. >>>>> >>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I >>>>> >>>>> am using jTessBoxEditor for creating box files ) >>>>> - Render text to image + box file. (Or create hand-made box files >>>>> for existing image data.) >>>>> - Make unicharset file. (Can be partially specified, ie created >>>>> manually). (Do not how to do this) >>>>> - Make a starter traineddata from the unicharset and optional >>>>> dictionary data. >>>>> >>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata> >>>>> - Run tesseract to process image + box file to make training data >>>>> set. >>>>> - Run training on training data set. >>>>> - Combine data files. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com <javascript:>. >>> To post to this group, send email to tesser...@googlegroups.com >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > Regards, > Soumik Ranjan Dasgupta > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.