I normally use a custom python file to generate the training text. Attaching a sample text corpus containing only digits 1234.
On Wed, Jul 18, 2018 at 12:04 PM Ramakant Kushwaha < ramakant.sing...@gmail.com> wrote: > @Soumik,Thanks Soumik, but I am not getting it, please provide me some > links to understand it. I am very new to this thing. can you guide me in > creating text corpus of digit with different fonts > > @Lorenzo, I want to detect digits written in boxex of below image, it's a > cash deposit form of a bank with very complex layout. I have to capture > details of account no and pan no. > > > <https://lh3.googleusercontent.com/-KlAWj6TbcPI/W07exdGvG6I/AAAAAAAAJGQ/4_32r8dwWVgwCfhM2XT358jkABGAArBoACLcBGAs/s1600/dummy_crop.jpg> > > > On Wednesday, July 18, 2018 at 11:38:42 AM UTC+5:30, Soumik Ranjan > Dasgupta wrote: >> >> Try creating a text corpus with only digits using various handwritten >> fonts that come close to your dataset from fonts.google.com. >> Use tesstrain.sh for rendering the images, and lstmtraining to train >> tesseract - you'll achieve a fair accuracy. >> >> On Tue, Jul 17, 2018 at 11:38 PM Lorenzo Bolzani <l.bo...@gmail.com> >> wrote: >> >>> >>> Generating the training data is a completely different problem from >>> training tesseract. >>> >>> If you want to recognize full words it's better to have full words (or >>> numbers), not individual characters so that the process of splitting the >>> words into characters is done by tesseract. >>> >>> Unless you just want to recognize individual characters. This looks more >>> like a MNIST-like task for a simple neural network. >>> >>> I think there are tools to cut images into lines but I've never used >>> one. Or you could do this by programming with opencv. >>> >>> There is no tool to generate the gt.txt you need to write these by hand. >>> In this case your text is very regular so you may just create one line >>> manually (1 2 3 4...) and duplicate that one. Or you could use a very good >>> online ocr service. >>> >>> >>> But I'm not convinced this data is good for training. How does the real >>> data that you want to recognize looks like? Individual digits or full >>> numbers? >>> >>> >>> >>> >>> 2018-07-17 19:17 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>: >>> >>>> *Thank you so much for guiding me. * >>>> >>>> *I had read links and sub-links provided and as suggested I will use >>>> OCR-D(*https://github.com/OCR-D/ocrd-train*) for training * >>>> I want to know what is the *best way to create pairs of [*.tif, >>>> *.gt.txt] from tif image for two and more fonts . Is their any specific >>>> tool to generate line *.tif and *.gt.txt files as required by OCR-D. * >>>> *I have data like below tiff image(Total 20 images), Please guide me * >>>> *Thank you* >>>> >>>> >>>> <https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff> >>>> >>>> >>>> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote: >>>>> >>>>> Hi everybody! >>>>> >>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without >>>>> success so far. Tesseract and Leptonica are installed by the scripts. >>>>> Inspired by the test set provided in that repo, I created pairs of >>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text >>>>> lines in total). >>>>> You can see an example of my set in attachment that also contains >>>>> files created by the training process. >>>>> >>>>> My guess is that something is wrong with my data. >>>>> Sometimes I can see the char train value increasing instead of >>>>> decreasing and the final error rate still too high (about 60%). >>>>> >>>>> That new training process with LSTM is driving me crazy! >>>>> I would appreciate if anyone with experience could take a look to my >>>>> data set. >>>>> >>>>> >>>>> Joe. >>>> >>>> >>>> On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote: >>>>> >>>>> >>>>> Have a look at this thread: >>>>> >>>>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ >>>>> >>>>> >>>>> It's easier than it seems, you do not need per character boxes with >>>>> 4.0, just one per line (that ocr-d automatically generates). If your text >>>>> is already split into lines you do not have to do anything more. >>>>> >>>>> Unicharset and lstmf files are also created by ocr-d. >>>>> >>>>> >>>>> Feel free to ask if you get stuck, now I have this working but it's a >>>>> bumpy road (lot of assertion failed/segmentation fault if you miss >>>>> something). >>>>> >>>>> >>>>> Bye >>>>> >>>>> Lorenzo >>>>> >>>>> 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>: >>>>> >>>>>> *Hi,* >>>>>> >>>>>> *Recently I trying to retrain Tesseract 4.0 for recognising >>>>>> handwritten digits. I am following official page but finding it very >>>>>> difficult. It would be great if someone can elaborate below steps* >>>>>> >>>>>> >>>>>> - Prepare training text. >>>>>> >>>>>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I >>>>>> am using jTessBoxEditor for creating box files ) >>>>>> - Render text to image + box file. (Or create hand-made box files >>>>>> for existing image data.) >>>>>> - Make unicharset file. (Can be partially specified, ie created >>>>>> manually). (Do not how to do this) >>>>>> - Make a starter traineddata from the unicharset and optional >>>>>> dictionary data. >>>>>> >>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata> >>>>>> - Run tesseract to process image + box file to make training data >>>>>> set. >>>>>> - Run training on training data set. >>>>>> - Combine data files. >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwvYgQiLO%2BdWDgaEtqOSg5sgezpic7_HggT5ij9qxZ2Ng%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> Regards, >> Soumik Ranjan Dasgupta >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ce16eecf-6f30-4e1f-b397-85beabc18301%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- Regards, Soumik Ranjan Dasgupta -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAB_aDAeLYPAo-dgrGNbinxS0tywm0hCvnEBDLPimx21vkRN7oA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
eng.training_text
Description: Binary data