date:20240903

[tesseract-ocr] What part of the code performs Box Identification?

2024-09-03 Thread 'Danny' via tesseract-ocr

I'm still trying to improve recognition of television subtitles, especially traditional Chinese (see here ) With either the stock *chi_tra* or our own trained model, it fails on certain text. To investigate, I used the API

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-03 Thread 'Danny' via tesseract-ocr

@zdenop wrote: | Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words) | Box files reflect that. And yes - box files are important. Zdenko, does this mean a "box file" for LSTM training should wrap the entire text line and NOT the individual characters? Which