@zdenop wrote: | Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words) | Box files reflect that. And yes - box files are important.
Zdenko, does this mean a "box file" for LSTM training should wrap the entire text line and NOT the individual characters? Which is correct for LSTM training: A) individual boxes like this, or [image: sub_2.png] B) One box for entire line: [image: sub_2 line.png] Thanks. On Sunday, July 14, 2024 at 9:05:48 PM UTC+8 zdenop wrote: > Ehm: > > 1. Tesseract v3 (legacy) engine training is based on characters. > 2. Tesseract LSTM engine (tesseract >=v4) training script is based on > lines (group of words) > > Box files reflect that. And yes - box files are important. > > > Zdenko > > > pi 12. 7. 2024 o 14:14 Mateusz Matela <mateusz...@gmail.com> napísal(a): > >> As an experiment, I run the training on a small sample produced with >> text2image. Then I converted the .box files so that each character is >> assigned common bounding rectangle from all the characters and run the >> training again. The outputs were identical in both cases. Then I removed >> the box file and let the training script autogenerate them. In that case >> the reported error rates were crazy, like 99% instead of 0.5%. >> This suggests that conclusion 3 is correct. >> >> środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a): >> >>> Hi all, >>> >>> Sorry if double posting, my previous message didn't appear and I don't >>> see any info about waiting for acceptance or something. >>> I was searching for this topic in this forum and it was mentioned a few >>> times, but I couldn't find a clear and definitive explanation. >>> >>> How does the information put in the .box files affect the training >>> process? The file contains coordinates for each character in the txt file, >>> but the documentation says that since Tesseract 4.0 the model operates on >>> the level of whole lines. Some tools like text2image generate the .box >>> files with accurate coordinates for each character. When the .box files are >>> missing the tesstrain Makefile generates them using generate_line_box.py, >>> which assigns the same full image area to each character. >>> >>> I see 3 possible conclusions, which one is closest to the truth? >>> >>> 1. The .box files do not affect the LSTM training at all and are just a >>> leftover from the times of Tesseract 3. In that case, ideally in the future >>> they could be completely dropped or only required/generated when >>> specifically working with the legacy engine. >>> >>> 2. There is still a chance that training will work better with exact >>> coordinates and the generate_line_box.py is just a cheap workaround that >>> could be improved on in the future. >>> >>> 3. The .box file is still important in case you prefer to define the >>> coordinates for the text in the image instead of cropping the image. The >>> granularity of the coordinates is not imporant as Tesseract will just work >>> on a box that encapsulates all of the character boxes. Even if confusing, >>> this approach is still better than having a different .box file formats for >>> LSTM and the legacy engine. >>> >>> I'll be grateful for any wisdom on this. >>> >>> Thanks >>> Mateusz >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ba9b210d-a38e-446d-80e1-4d22b213f210n%40googlegroups.com.