Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

Zdenko Podobny Sun, 14 Jul 2024 06:05:42 -0700

Ehm:

   1. Tesseract v3 (legacy) engine training is based on characters.
   2. Tesseract LSTM engine (tesseract >=v4) training script is based on
   lines (group of words)


Box files reflect that. And yes - box files are important.


Zdenko


pi 12. 7. 2024 o 14:14 Mateusz Matela <mateusz.mat...@gmail.com> napísal(a):

> As an experiment, I run the training on a small sample produced with
> text2image. Then I converted the .box files so that each character is
> assigned common bounding rectangle from all the characters and run the
> training again. The outputs were identical in both cases. Then I removed
> the box file and let the training script autogenerate them. In that case
> the reported error rates were crazy, like 99% instead of 0.5%.
> This suggests that conclusion 3 is correct.
>
> środa, 10 lipca 2024 o 15:17:07 UTC+2 Mateusz Matela napisał(a):
>
>> Hi all,
>>
>> Sorry if double posting, my previous message didn't appear and I don't
>> see any info about waiting for acceptance or something.
>> I was searching for this topic in this forum and it was mentioned a few
>> times, but I couldn't find a clear and definitive explanation.
>>
>> How does the information put in the .box files affect the training
>> process? The file contains coordinates for each character in the txt file,
>> but the documentation says that since Tesseract 4.0 the model operates on
>> the level of whole lines. Some tools like text2image generate the .box
>> files with accurate coordinates for each character. When the .box files are
>> missing the tesstrain Makefile generates them using generate_line_box.py,
>> which assigns the same full image area to each character.
>>
>> I see 3 possible conclusions, which one is closest to the truth?
>>
>> 1. The .box files do not affect the LSTM training at all and are just a
>> leftover from the times of Tesseract 3. In that case, ideally in the future
>> they could be completely dropped or only required/generated when
>> specifically working with the legacy engine.
>>
>> 2. There is still a chance that training will work better with exact
>> coordinates and the generate_line_box.py is just a cheap workaround that
>> could be improved on in the future.
>>
>> 3. The .box file is still important in case you prefer to define the
>> coordinates for the text in the image instead of cropping the image. The
>> granularity of the coordinates is not imporant as Tesseract will just work
>> on a box that encapsulates all of the character boxes. Even if confusing,
>> this approach is still better than having a different .box file formats for
>> LSTM and the legacy engine.
>>
>> I'll be grateful for any wisdom on this.
>>
>> Thanks
>> Mateusz
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wrnDN%3D%3Dws6U3nv%2B9ef%3D64rvpGPa2Pf-838dmHH8fM97A%40mail.gmail.com.

Re: [tesseract-ocr] Re: Tesseract training ground truth: I'm confused about the box files

Reply via email to