I think this google group is having technical troubles. I got an email about a new post from Menelik Berhan but his message doesn't appear on the web. He said:
*| This might be helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html <https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html>| And also some details in: https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files <https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files>* Same what Tom said. Very helpful! To summarize: - Box files always contain one line per character - There are two kinds of box files: *per-character* and *per-line* box files - per-character box files have separate coordinates for each character - per-line box files still have one line per character, but the coordinates are always the same and represent the bounding box of the entire text The training code, specifically *Tesseract::TrainFromBoxes(), **should* accept either format. As mentioned in this and other posts, the box identification for Chinese seems to be quite broken. Like this: [image: Screenshot 2024-08-05 at 17.56.12.png] That might or might not be a training issue, but I will try retraining the model using *per-line* box files and see if that makes any difference. Thanks to all. On Friday, September 6, 2024 at 11:18:44 PM UTC+8 tfmo...@gmail.com wrote: > That's weird. I posted an answer to this thread yesterday and now, in it's > place, Google Groups says "Message has been deleted." Let me try again... > > This page > https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html > says "lstmbox - Generated by tesseract using lstmbox config from image > files - each char uses coordinates of its entire line. This format is also > generated by the tesstrain makefile." > > Tom > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b0c4e374-2f79-486f-acb4-acf686119ba2n%40googlegroups.com.