Hi,
On 27/04/2022 19:07, Brad wrote:
For V5.10.0 of Tesseract, one of the changes is:
(correction: version 5.1.0)
Handle image and line separator regions in ALTO, hOCR and text output
formats.
I'm curious about what this means. Can Tesseract be used to identify
rectangles and such on an image that might surround a text region, and if
so, is this what this is referring to? Are there any examples showing how
this works?
Here is the commit in question:
https://github.com/tesseract-ocr/tesseract/commit/424b17f997363670d187f42c43408c472fe55053
(for some background see
https://github.com/tesseract-ocr/tesseract/pull/3710)
The output added to say hOCR is "ocr_photo" and "ocr_separator". You can
see how the results are iterated over in the source if you would like to
use that yourself.
My/our immediate use case is detecting photos on pages of books and
articles, which will be emitted as ocr_photo when outputting hOCR.
I don't know if this can help in your specific use case, but if you're
interested in finding images, it will help for sure. I cannot really
comment on the ocr_separator parts so much.
Regards,
Merlijn
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/91410f1a-a620-7578-fb33-6908681e9b49%40archive.org.