Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

'Danny' via tesseract-ocr Tue, 03 Sep 2024 07:05:37 -0700

@zdenop wrote:
| Tesseract LSTM engine (tesseract >=v4) training script is based on lines 
(group of words)
| Box files reflect that. And yes - box files are important.


Zdenko, does this mean a "box file" for LSTM training should wrap the 
entire text line and NOT the individual characters?
Which is correct for LSTM training:

A) individual boxes like this, or
[image: sub_2.png]
B) One box for entire line:
[image: sub_2 line.png]
Thanks.

On Sunday, July 14, 2024 at 9:05:48 PM UTC+8 zdenop wrote:

> Ehm:
>
>    1. Tesseract v3 (legacy) engine training is based on characters.
>    2. Tesseract LSTM engine (tesseract >=v4) training script is based on 
>    lines (group of words)
>
> Box files reflect that. And yes - box files are important.
>
>
> Zdenko
>
>
> pi 12. 7. 2024 o 14:14 Mateusz Matela <mateusz...@gmail.com> napísal(a):
>
>> As an experiment, I run the training on a small sample produced with 
>> text2image. Then I converted the .box files so that each character is 
>> assigned common bounding rectangle from all the characters and run the 
>> training again. The outputs were identical in both cases. Then I removed 
>> the box file and let the training script autogenerate them. In that case 
>> the reported error rates were crazy, like 99% instead of 0.5%.
>> This suggests that conclusion 3 is correct.
>>
>> środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a):
>>
>>> Hi all,
>>>
>>> Sorry if double posting, my previous message didn't appear and I don't 
>>> see any info about waiting for acceptance or something.
>>> I was searching for this topic in this forum and it was mentioned a few 
>>> times, but I couldn't find a clear and definitive explanation.
>>>
>>> How does the information put in the .box files affect the training 
>>> process? The file contains coordinates for each character in the txt file, 
>>> but the documentation says that since Tesseract 4.0 the model operates on 
>>> the level of whole lines. Some tools like text2image generate the .box 
>>> files with accurate coordinates for each character. When the .box files are 
>>> missing the tesstrain Makefile generates them using generate_line_box.py, 
>>> which assigns the same full image area to each character.
>>>
>>> I see 3 possible conclusions, which one is closest to the truth?
>>>
>>> 1. The .box files do not affect the LSTM training at all and are just a 
>>> leftover from the times of Tesseract 3. In that case, ideally in the future 
>>> they could be completely dropped or only required/generated when 
>>> specifically working with the legacy engine.
>>>
>>> 2. There is still a chance that training will work better with exact 
>>> coordinates and the generate_line_box.py is just a cheap workaround that 
>>> could be improved on in the future.
>>>
>>> 3. The .box file is still important in case you prefer to define the 
>>> coordinates for the text in the image instead of cropping the image. The 
>>> granularity of the coordinates is not imporant as Tesseract will just work 
>>> on a box that encapsulates all of the character boxes. Even if confusing, 
>>> this approach is still better than having a different .box file formats for 
>>> LSTM and the legacy engine.
>>>
>>> I'll be grateful for any wisdom on this.
>>>
>>> Thanks
>>> Mateusz
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ba9b210d-a38e-446d-80e1-4d22b213f210n%40googlegroups.com.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

Reply via email to