Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-06 Thread 'Danny' via tesseract-ocr
I think this google group is having technical troubles. I got an email about a new post from Menelik Berhan but his message doesn't appear on the web. He said: *| This might be helpful: https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-05 Thread 'Danny' via tesseract-ocr
ithub.com/tesseract-ocr/tesstrain/blob/main/ocrd-testset.zip> > > Zdenko > > > ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr < > tesser...@googlegroups.com> napísal(a): > >> @zdenop wrote: >> | Tesseract LSTM engine (tesseract >=v4) training

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

2024-09-03 Thread &#x27;Danny&#x27; via tesseract-ocr
@zdenop wrote: | Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words) | Box files reflect that. And yes - box files are important. Zdenko, does this mean a "box file" for LSTM training should wrap the entire text line and NOT the individual characters? Which

[tesseract-ocr] What part of the code performs Box Identification?

2024-09-03 Thread &#x27;Danny&#x27; via tesseract-ocr
I'm still trying to improve recognition of television subtitles, especially traditional Chinese (see here ) With either the stock *chi_tra* or our own trained model, it fails on certain text. To investigate, I used the API

[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

2024-08-09 Thread &#x27;Danny&#x27; via tesseract-ocr
Yeah, that could be true. But still trying to figure out *where* in the code to put any new segmentation and glyph identification. BTW, to generate additional training data, I wrote a program on the Mac to scrap text from the subtitle images. The resulting OCR output from Apple's Vision frame

[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

2024-08-05 Thread &#x27;Danny&#x27; via tesseract-ocr
Hi Tom, Thanks for the suggestion! We've been using PSM 6 (Assume a single uniform block of text) and, for that input image, it outputs nothing for both the stock chi_tra.traineddata and our in-house trained data file. However... I just tried PSM 13 ("Raw line. Treat the image as a single text

[tesseract-ocr] Re: Tesseract not working for some single examples.

2024-08-04 Thread &#x27;Danny&#x27; via tesseract-ocr
If you can, try pre-processing and inverting the image so it is black text on a white background. I found that recognition works much better with the preprocessing (probably since the models were trained with that kind of input) On Tuesday, July 30, 2024 at 10:45:56 PM UTC+8 allelu...@gmail.co

Re: [tesseract-ocr] Re: why are there no new trained models since 2018?

2024-08-02 Thread &#x27;Danny&#x27; via tesseract-ocr
I recently retrained the chi_tra model with a new font. The existing model would confuse certain characters. In addition, the source images (I'm decoding TV subtitles) had a weirdly shaped question mark. In the sample below the last two characters output as the number "7". [image: chi_tra_7_0_Q

[tesseract-ocr] Re: Chinise characters.

2024-08-02 Thread &#x27;Danny&#x27; via tesseract-ocr
I had many similar issues, especially with input with Yuan (rounded) fonts. In the end I found the exact font used and ran additional training with the new font. Even after retraining some characters would be confused with others (like your case). To strengthen those, I included many instan

[tesseract-ocr] Re: No output when Chinese Traditional followed by dots or ellipsis

2024-08-02 Thread &#x27;Danny&#x27; via tesseract-ocr
Can any one suggest some debug settings I can activate to try to trace down why I'm getting no output? Thanks Danny On Tuesday, July 30, 2024 at 8:23:38 PM UTC+8 Danny wrote: > I have a problem where tesseract produces no output (zero byte output > file) when presented with Chinese characters f

[tesseract-ocr] No output when Chinese Traditional followed by dots or ellipsis

2024-07-30 Thread &#x27;Danny&#x27; via tesseract-ocr
I have a problem where tesseract produces no output (zero byte output file) when presented with Chinese characters followed by either an ellipsis or three periods. [image: bad_sub_243.png] If I crop the image in photoshop to remove the dots, the three Chinese characters are recognized perfectl

[tesseract-ocr] Re: Render Ground Truth from Scratch for Training

2023-10-20 Thread &#x27;Danny&#x27; via tesseract-ocr
The docs are pretty bad so I'm not surprised you didn't find an answer. We also needed to train against a unusual font so here's our experience. Your situation might be different. 1. the training data needs to be much much bigger than 100 lines. We took the ".wordlist" file from the language da

[tesseract-ocr] Re: Should box include surrounding space?

2023-10-18 Thread &#x27;Danny&#x27; via tesseract-ocr
There are a few "commas" used in CJK which makes it complicated for me. *FULLWIDTH COMMA U+FF0C* (link ) which might have the glyph in the center of the box or in the lower left corner depending on the font: [image: Screenshot 2023-10-18 at 17.19.27.pn