Hi there, Any news about training Tesseract 4.0? I'm trying to train my fonts for few days... but still can't go anywhere. I'm looking for compact manual for training from txt/ttf files for new fonts in various languages.
Will be appreciate for any help. On Tuesday, February 21, 2017 at 8:06:40 AM UTC+1, timothylegg wrote: > > I found this thread to be interesting since I tried training Tesseract a > few years ago and gave up. Has anybody considered writing any > documentation on this something that is best explained whenever a user > can't figure it out from trial/error? I'm open to maybe writing about this > if there is a need for it, but first, I will have to understand it better > myself. > > > On Thursday, February 9, 2017 at 4:08:13 AM UTC-6, Kay-Michael Würzner > wrote: >> >> Thanks also from my side. I'll have a look into the jTessBoxEditor beta, >> try to setup training and get back to you. >> >> Kay >> >> On Wednesday, February 8, 2017 at 3:52:58 PM UTC+1, shree wrote: >>> >>> Thanks, Quan >>> >>> - excuse the brevity, sent from mobile >>> >>> On 08-Feb-2017 7:33 PM, "Quan Nguyen" <nguy...@gmail.com> wrote: >>> >>>> >>>> >>>> On Tuesday, February 7, 2017 at 9:34:11 AM UTC-6, shree wrote: >>>>> >>>>> For LSTM training, box files need to have an additional line for each >>>>> text line with the tab character to indicate a new line. >>>>> >>>>> If you have existing box/tiff pairs, you can use a box editor (such as >>>>> jtessboxeditor) and insert a box at end of each line and add a tab >>>>> character in it. >>>>> >>>> >>>> The jTessBoxEditor beta version has a new Mark EOL function that does >>>> just that. >>>> >>>> >>>>> >>>>> >On the toolbar, the Character textbox has a built-in conversion >>>>> function. If you enter U+0009 and hit Enter key or click on the adjacent >>>>> Tool icon, the escape sequences will be converted to Unicode. You can >>>>> also >>>>> enter the tab character via Alt+09 numpad keys on Windows. >>>>> >>>>> o >>>>> r add a dummy sequence such as @@@ and then replace to tab character >>>>> in a text editor. >>>>> >>>>> See attached files as a sample. >>>>> >>>>> Then modify tesstrain.sh to copy the box tiff pairs to the training >>>>> directory before starting training >>>>> >>>>> >>>>> >>>>> mkdir -p ${TRAINING_DIR} >>>>> tlog "\n=== Starting training for language '${LANG_CODE}'" >>>>> >>>>> cp ./*.box "${TRAINING_DIR}/" >>>>> cp ./*.tif "${TRAINING_DIR}/" >>>>> >>>>> >>>>> On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <wuer...@gmail.com >>>>> > wrote: >>>>> >>>>>> +1 for this question. The training documentation for Tesseract 4.0 by >>>>>> now only covers training with font files (synthetic materials). What is >>>>>> missing is information on training with real data (i.e. manually aligned >>>>>> ground truth). >>>>>> Any hints on that matter are greatly appreciated. >>>>>> >>>>>> Cheers, >>>>>> Kay >>>>>> >>>>>> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, >>>>>> chen...@huawei.com wrote: >>>>>>> >>>>>>> I have a bunch of images, containing English words. >>>>>>> I would like to generate training data by these images, and do the >>>>>>> training. >>>>>>> How should I do? >>>>>>> >>>>>>> Thanks a lot. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c10bdeb-964b-4a0f-bf0d-7e22ad6111cd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.