I found this thread to be interesting since I tried training Tesseract a few years ago and gave up. Has anybody considered writing any documentation on this something that is best explained whenever a user can't figure it out from trial/error? I'm open to maybe writing about this if there is a need for it, but first, I will have to understand it better myself.
On Thursday, February 9, 2017 at 4:08:13 AM UTC-6, Kay-Michael Würzner wrote: > > Thanks also from my side. I'll have a look into the jTessBoxEditor beta, > try to setup training and get back to you. > > Kay > > On Wednesday, February 8, 2017 at 3:52:58 PM UTC+1, shree wrote: >> >> Thanks, Quan >> >> - excuse the brevity, sent from mobile >> >> On 08-Feb-2017 7:33 PM, "Quan Nguyen" <nguy...@gmail.com> wrote: >> >>> >>> >>> On Tuesday, February 7, 2017 at 9:34:11 AM UTC-6, shree wrote: >>>> >>>> For LSTM training, box files need to have an additional line for each >>>> text line with the tab character to indicate a new line. >>>> >>>> If you have existing box/tiff pairs, you can use a box editor (such as >>>> jtessboxeditor) and insert a box at end of each line and add a tab >>>> character in it. >>>> >>> >>> The jTessBoxEditor beta version has a new Mark EOL function that does >>> just that. >>> >>> >>>> >>>> >On the toolbar, the Character textbox has a built-in conversion >>>> function. If you enter U+0009 and hit Enter key or click on the adjacent >>>> Tool icon, the escape sequences will be converted to Unicode. You can also >>>> enter the tab character via Alt+09 numpad keys on Windows. >>>> >>>> o >>>> r add a dummy sequence such as @@@ and then replace to tab character >>>> in a text editor. >>>> >>>> See attached files as a sample. >>>> >>>> Then modify tesstrain.sh to copy the box tiff pairs to the training >>>> directory before starting training >>>> >>>> >>>> >>>> mkdir -p ${TRAINING_DIR} >>>> tlog "\n=== Starting training for language '${LANG_CODE}'" >>>> >>>> cp ./*.box "${TRAINING_DIR}/" >>>> cp ./*.tif "${TRAINING_DIR}/" >>>> >>>> >>>> On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <wuer...@gmail.com> >>>> wrote: >>>> >>>>> +1 for this question. The training documentation for Tesseract 4.0 by >>>>> now only covers training with font files (synthetic materials). What is >>>>> missing is information on training with real data (i.e. manually aligned >>>>> ground truth). >>>>> Any hints on that matter are greatly appreciated. >>>>> >>>>> Cheers, >>>>> Kay >>>>> >>>>> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, >>>>> chen...@huawei.com wrote: >>>>>> >>>>>> I have a bunch of images, containing English words. >>>>>> I would like to generate training data by these images, and do the >>>>>> training. >>>>>> How should I do? >>>>>> >>>>>> Thanks a lot. >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0ce05b54-17fd-45e7-8719-234c046564c1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.