tesstrain is a tested method to train/improve tesseract language mode. It
creates box files for you.
You can try your ways, but your problems are your problems and you should
not to expect somebody will adjust the code to your needs.
Of course, you are welcome to contribute your solution.

Zdenko


so 7. 9. 2024 o 3:55 'Danny' via tesseract-ocr <
tesseract-ocr@googlegroups.com> napísal(a):

> I think this google group is having technical troubles.  I got an email
> about a new post from Menelik Berhan but his message doesn't appear on the
> web.  He said:
>
>
>
> *| This might be
> helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
> <https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html>| And
> also some details
> in: 
> https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files
> <https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#making-box-files>*
>
> Same what Tom said. Very helpful!
>
> To summarize:
> - Box files always contain one line per character
> - There are two kinds of box files: *per-character* and *per-line* box
> files
> - per-character box files have separate coordinates for each character
> - per-line box files still have one line per character, but the
> coordinates are always the same and represent the bounding box of the
> entire text
>
> The training code, specifically *Tesseract::TrainFromBoxes(), **should* accept
> either format.
>
> As mentioned in this and other posts, the box identification for Chinese
> seems to be quite broken. Like this:
> [image: Screenshot 2024-08-05 at 17.56.12.png]
>
> That might or might not be a training issue, but I will try retraining the
> model using *per-line* box files and see if that makes any difference.
>
> Thanks to all.
>
> On Friday, September 6, 2024 at 11:18:44 PM UTC+8 tfmo...@gmail.com wrote:
>
>> That's weird. I posted an answer to this thread yesterday and now, in
>> it's place, Google Groups says "Message has been deleted." Let me try
>> again...
>>
>> This page
>> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html
>> says "lstmbox - Generated by tesseract using lstmbox config from image
>> files - each char uses coordinates of its entire line. This format is also
>> generated by the tesstrain makefile."
>>
>> Tom
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b0c4e374-2f79-486f-acb4-acf686119ba2n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b0c4e374-2f79-486f-acb4-acf686119ba2n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yMv14YDKcP%2BgzRsrn6iUXiTFVEiD_-q9YtccNGmbuNKA%40mail.gmail.com.

Reply via email to