Hi all,

Sorry if double posting, my previous message didn't appear and I don't see 
any info about waiting for acceptance or something.
I was searching for this topic in this forum and it was mentioned a few 
times, but I couldn't find a clear and definitive explanation.

How does the information put in the .box files affect the training process? 
The file contains coordinates for each character in the txt file, but the 
documentation says that since Tesseract 4.0 the model operates on the level 
of whole lines. Some tools like text2image generate the .box files with 
accurate coordinates for each character. When the .box files are missing 
the tesstrain Makefile generates them using generate_line_box.py, which 
assigns the same full image area to each character.

I see 3 possible conclusions, which one is closest to the truth?

1. The .box files do not affect the LSTM training at all and are just a 
leftover from the times of Tesseract 3. In that case, ideally in the future 
they could be completely dropped or only required/generated when 
specifically working with the legacy engine.

2. There is still a chance that training will work better with exact 
coordinates and the generate_line_box.py is just a cheap workaround that 
could be improved on in the future.

3. The .box file is still important in case you prefer to define the 
coordinates for the text in the image instead of cropping the image. The 
granularity of the coordinates is not imporant as Tesseract will just work 
on a box that encapsulates all of the character boxes. Even if confusing, 
this approach is still better than having a different .box file formats for 
LSTM and the legacy engine.

I'll be grateful for any wisdom on this.

Thanks
Mateusz

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a048c18c-048f-44cb-8d1a-dfaf509358e9n%40googlegroups.com.

Reply via email to