I found this thread to be interesting since I tried training Tesseract a 
few years ago and gave up.  Has anybody considered writing any 
documentation on this something that is best explained whenever a user 
can't figure it out from trial/error?  I'm open to maybe writing about this 
if there is a need for it, but first, I will have to understand it better 
myself.


On Thursday, February 9, 2017 at 4:08:13 AM UTC-6, Kay-Michael Würzner 
wrote:
>
> Thanks also from my side. I'll have a look into the jTessBoxEditor beta, 
> try to setup training and get back to you.
>
> Kay
>
> On Wednesday, February 8, 2017 at 3:52:58 PM UTC+1, shree wrote:
>>
>> Thanks, Quan
>>
>> - excuse the brevity, sent from mobile
>>
>> On 08-Feb-2017 7:33 PM, "Quan Nguyen" <nguy...@gmail.com> wrote:
>>
>>>
>>>
>>> On Tuesday, February 7, 2017 at 9:34:11 AM UTC-6, shree wrote:
>>>>
>>>> ​For LSTM training, box files need to have an additional line for each 
>>>> text line with the tab character to indicate a new line.
>>>>
>>>> If you have existing box/tiff pairs, you can use a box editor (such as 
>>>> jtessboxeditor) and insert a box at end of each line and add a tab 
>>>> character in it.
>>>>
>>>
>>> The jTessBoxEditor beta version has a new Mark EOL function that does 
>>> just that.
>>>  
>>>
>>>>
>>>> >On the toolbar, the Character textbox has a built-in conversion 
>>>> function. If you enter U+0009 and hit Enter key or click on the adjacent 
>>>> Tool icon, the escape sequences will be converted to Unicode. You can also 
>>>> enter the tab character via Alt+09 numpad keys on Windows.
>>>>
>>>> o
>>>> ​r add a dummy sequence such as @@@ and then replace to tab character 
>>>> in a text editor.
>>>> ​
>>>> ​See attached files as a sample.
>>>>
>>>> Then modify tesstrain.sh to copy the box tiff pairs to the training 
>>>> directory before starting training
>>>>
>>>>
>>>>
>>>> mkdir -p ${TRAINING_DIR}
>>>> tlog "\n=== Starting training for language '${LANG_CODE}'"
>>>>
>>>> cp  ./*.box "${TRAINING_DIR}/"
>>>> cp  ./*.tif "${TRAINING_DIR}/"​
>>>>
>>>>
>>>> On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <wuer...@gmail.com> 
>>>> wrote:
>>>>
>>>>> +1 for this question. The training documentation for Tesseract 4.0 by 
>>>>> now only covers training with font files (synthetic materials). What is 
>>>>> missing is information on training with real data (i.e. manually aligned 
>>>>> ground truth).
>>>>> Any hints on that matter are greatly appreciated.
>>>>>
>>>>> Cheers,
>>>>> Kay
>>>>>
>>>>> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, 
>>>>> chen...@huawei.com wrote:
>>>>>>
>>>>>> I have a bunch of images, containing English words.
>>>>>> I would like to generate training data by these images, and do the 
>>>>>> training.
>>>>>> How should I do?
>>>>>>
>>>>>> Thanks a lot.
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0ce05b54-17fd-45e7-8719-234c046564c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to