As an experiment, I run the training on a small sample produced with text2image. Then I converted the .box files so that each character is assigned common bounding rectangle from all the characters and run the training again. The outputs were identical in both cases. Then I removed the box file and let the training script autogenerate them. In that case the reported error rates were crazy, like 99% instead of 0.5%. This suggests that conclusion 3 is correct.
środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a): > Hi all, > > Sorry if double posting, my previous message didn't appear and I don't see > any info about waiting for acceptance or something. > I was searching for this topic in this forum and it was mentioned a few > times, but I couldn't find a clear and definitive explanation. > > How does the information put in the .box files affect the training > process? The file contains coordinates for each character in the txt file, > but the documentation says that since Tesseract 4.0 the model operates on > the level of whole lines. Some tools like text2image generate the .box > files with accurate coordinates for each character. When the .box files are > missing the tesstrain Makefile generates them using generate_line_box.py, > which assigns the same full image area to each character. > > I see 3 possible conclusions, which one is closest to the truth? > > 1. The .box files do not affect the LSTM training at all and are just a > leftover from the times of Tesseract 3. In that case, ideally in the future > they could be completely dropped or only required/generated when > specifically working with the legacy engine. > > 2. There is still a chance that training will work better with exact > coordinates and the generate_line_box.py is just a cheap workaround that > could be improved on in the future. > > 3. The .box file is still important in case you prefer to define the > coordinates for the text in the image instead of cropping the image. The > granularity of the coordinates is not imporant as Tesseract will just work > on a box that encapsulates all of the character boxes. Even if confusing, > this approach is still better than having a different .box file formats for > LSTM and the legacy engine. > > I'll be grateful for any wisdom on this. > > Thanks > Mateusz > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com.