Re: [tesseract-ocr] "Line separator regions" capabilities?

Merlijn B.W. Wajer Wed, 27 Apr 2022 10:18:39 -0700

Hi,

On 27/04/2022 19:07, Brad wrote:

For V5.10.0 of Tesseract, one of the changes is:


(correction: version 5.1.0)

Handle image and line separator regions in ALTO, hOCR and text output

formats.

I'm curious about what this means. Can Tesseract be used to identify
rectangles and such on an image that might surround a text region, and if
so, is this what this is referring to? Are there any examples showing how
this works?

Here is the commit in question:https://github.com/tesseract-ocr/tesseract/commit/424b17f997363670d187f42c43408c472fe55053(for some background seehttps://github.com/tesseract-ocr/tesseract/pull/3710)

The output added to say hOCR is "ocr_photo" and "ocr_separator". You cansee how the results are iterated over in the source if you would like touse that yourself.

My/our immediate use case is detecting photos on pages of books andarticles, which will be emitted as ocr_photo when outputting hOCR.

I don't know if this can help in your specific use case, but if you'reinterested in finding images, it will help for sure. I cannot reallycomment on the ocr_separator parts so much.


Regards,
Merlijn

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/91410f1a-a620-7578-fb33-6908681e9b49%40archive.org.

Re: [tesseract-ocr] "Line separator regions" capabilities?

Reply via email to