*Thank you so much for guiding me. * *I had read links and sub-links provided and as suggested I will use OCR-D(* https://github.com/OCR-D/ocrd-train*) for training * I want to know what is the *best way to create pairs of [*.tif, *.gt.txt] from tif image for two and more fonts . Is their any specific tool to generate line *.tif and *.gt.txt files as required by OCR-D. * *I have data like below tiff image(Total 20 images), Please guide me * *Thank you*
<https://lh3.googleusercontent.com/-wdzw32GT4fk/W04iwd71ldI/AAAAAAAAJFA/lx3BfSnCujkKmch4oGRSJLFgkKG1uvuTgCLcBGAs/s1600/SCAN_20180716_145539118.tiff> On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote: > > Hi everybody! > > I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without > success so far. Tesseract and Leptonica are installed by the scripts. > Inspired by the test set provided in that repo, I created pairs of [*.tif, > *.gt.txt] with binarized chars and TTF's from two fonts (1869 text lines in > total). > You can see an example of my set in attachment that also contains files > created by the training process. > > My guess is that something is wrong with my data. > Sometimes I can see the char train value increasing instead of decreasing > and the final error rate still too high (about 60%). > > That new training process with LSTM is driving me crazy! > I would appreciate if anyone with experience could take a look to my data > set. > > > Joe. On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote: > > > Have a look at this thread: > > https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ > > > It's easier than it seems, you do not need per character boxes with 4.0, > just one per line (that ocr-d automatically generates). If your text is > already split into lines you do not have to do anything more. > > Unicharset and lstmf files are also created by ocr-d. > > > Feel free to ask if you get stuck, now I have this working but it's a > bumpy road (lot of assertion failed/segmentation fault if you miss > something). > > > Bye > > Lorenzo > > 2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com > <javascript:>>: > >> *Hi,* >> >> *Recently I trying to retrain Tesseract 4.0 for recognising handwritten >> digits. I am following official page but finding it very difficult. It >> would be great if someone can elaborate below steps* >> >> - Prepare training text. >> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>(I >> >> am using jTessBoxEditor for creating box files ) >> - Render text to image + box file. (Or create hand-made box files for >> existing image data.) >> - Make unicharset file. (Can be partially specified, ie created >> manually). (Do not how to do this) >> - Make a starter traineddata from the unicharset and optional dictionary >> data. >> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata> >> - Run tesseract to process image + box file to make training data set. >> - Run training on training data set. >> - Combine data files. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/885fce6d-2b81-4bc2-9eee-4dea8df5c263%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.