I think this google group is having technical troubles.  I got an email 
about a new post from Menelik Berhan but his message doesn't appear on the 
web.  He said:



*| This might be 
helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html 
<https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html>| And 
also some details 
in: 
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files
 
<https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files>*

Same what Tom said. Very helpful!

To summarize:
- Box files always contain one line per character
- There are two kinds of box files: *per-character* and *per-line* box files
- per-character box files have separate coordinates for each character
- per-line box files still have one line per character, but the coordinates 
are always the same and represent the bounding box of the entire text

The training code, specifically *Tesseract::TrainFromBoxes(), **should* accept 
either format.

As mentioned in this and other posts, the box identification for Chinese 
seems to be quite broken. Like this:
[image: Screenshot 2024-08-05 at 17.56.12.png]

That might or might not be a training issue, but I will try retraining the 
model using *per-line* box files and see if that makes any difference.

Thanks to all.

On Friday, September 6, 2024 at 11:18:44 PM UTC+8 tfmo...@gmail.com wrote:

> That's weird. I posted an answer to this thread yesterday and now, in it's 
> place, Google Groups says "Message has been deleted." Let me try again...
>
> This page 
> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
> says "lstmbox - Generated by tesseract using lstmbox config from image 
> files - each char uses coordinates of its entire line. This format is also 
> generated by the tesstrain makefile."
>
> Tom
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b0c4e374-2f79-486f-acb4-acf686119ba2n%40googlegroups.com.

Reply via email to