It's clear now. Thanks for the information.

Jun

在 2018年11月12日星期一 UTC+8下午7:38:19,Lorenzo Blz写道:

> Il giorno lun 12 nov 2018 alle ore 11:53 <fav...@gmail.com <javascript:>> 
> ha scritto:
>
>> That means we can label some existing images with text line boxes instead 
>> of individual char boxes in current tesseract 4.0? I checked the box files 
>> generated by the training process and found that char boxes were still 
>> there.
>>
>
> Yes it is confusing. I use ocrd-train 
> <https://github.com/OCR-D/ocrd-train> and it generates boxes for the 
> whole lines.
>
> This is an example generated from a small python script from ocrd-train:
>
> M 0 0 244 50 0
> I 0 0 244 50 0
> T 0 0 244 50 0
> - 0 0 244 50 0
> U 0 0 244 50 0
> C 0 0 244 50 0
> O 0 0 244 50 0
>      244 50 245 51 0
>
> Ground truth is MIT-UCO, image size is 244x50. Here it lists each 
> individual character but the box size is always the full line for all of 
> them.
>
> I use pre-cut images containing single lines, this is why the box cover 
> the whole image. The same thing should work for a large image with multiple 
> lines (but I never did it myself).
>
> You could try to use hocr to split the file in lines see here: 
> https://github.com/OCR-D/ocrd-train/issues/7#issuecomment-419714852
>
>
> BTW the coords look like: left, top, right, bottom and not <left> <bottom> 
> <right> <top> as in the docs 
> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data>
>  
> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data>:
>  
> am I missing something?
>
>
> Bye
>
> Lorenzo
>
>
>  
>
>>
>> Thanks,
>> Jun
>>
>> 在 2018年11月12日星期一 UTC+8下午5:26:48,Lorenzo Blz写道:
>>
>>>
>>> Tesseract 4.x uses lines, not chars.
>>>
>>>
>>> Bye
>>>
>>> Lorenzo
>>>
>>> Il giorno lun 12 nov 2018 alle ore 05:42 <fav...@gmail.com> ha scritto:
>>>
>>>> Dear All,
>>>>
>>>>       Currently, tesseract training is based on the pair (tiff and 
>>>> box). It's not easy to make box file (char level) if we try to train some 
>>>> scanned document images not generated by programs.
>>>> My question is whether we have a plan to support line level training in 
>>>> future? Thanks!
>>>>
>>>> Regards,
>>>> Jun
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/45e79c69-dd7f-461f-a7d6-53c912c2c689%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to