It's clear now. Thanks for the information. Jun
在 2018年11月12日星期一 UTC+8下午7:38:19,Lorenzo Blz写道: > Il giorno lun 12 nov 2018 alle ore 11:53 <fav...@gmail.com <javascript:>> > ha scritto: > >> That means we can label some existing images with text line boxes instead >> of individual char boxes in current tesseract 4.0? I checked the box files >> generated by the training process and found that char boxes were still >> there. >> > > Yes it is confusing. I use ocrd-train > <https://github.com/OCR-D/ocrd-train> and it generates boxes for the > whole lines. > > This is an example generated from a small python script from ocrd-train: > > M 0 0 244 50 0 > I 0 0 244 50 0 > T 0 0 244 50 0 > - 0 0 244 50 0 > U 0 0 244 50 0 > C 0 0 244 50 0 > O 0 0 244 50 0 > 244 50 245 51 0 > > Ground truth is MIT-UCO, image size is 244x50. Here it lists each > individual character but the box size is always the full line for all of > them. > > I use pre-cut images containing single lines, this is why the box cover > the whole image. The same thing should work for a large image with multiple > lines (but I never did it myself). > > You could try to use hocr to split the file in lines see here: > https://github.com/OCR-D/ocrd-train/issues/7#issuecomment-419714852 > > > BTW the coords look like: left, top, right, bottom and not <left> <bottom> > <right> <top> as in the docs > <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data> > > <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-training-data>: > > am I missing something? > > > Bye > > Lorenzo > > > > >> >> Thanks, >> Jun >> >> 在 2018年11月12日星期一 UTC+8下午5:26:48,Lorenzo Blz写道: >> >>> >>> Tesseract 4.x uses lines, not chars. >>> >>> >>> Bye >>> >>> Lorenzo >>> >>> Il giorno lun 12 nov 2018 alle ore 05:42 <fav...@gmail.com> ha scritto: >>> >>>> Dear All, >>>> >>>> Currently, tesseract training is based on the pair (tiff and >>>> box). It's not easy to make box file (char level) if we try to train some >>>> scanned document images not generated by programs. >>>> My question is whether we have a plan to support line level training in >>>> future? Thanks! >>>> >>>> Regards, >>>> Jun >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/f65d5fba-d466-41bf-863b-c258d2291ffc%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/45e79c69-dd7f-461f-a7d6-53c912c2c689%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.