On Thursday, October 17, 2024 at 6:29:51 PM UTC-4 paul...@gmail.com wrote:
There must be some parameter that would force tesseract to return ALL text blocks, not just the ones it considers more significant (which the large paragraphs are). Your investigations seem to confirm what has been widely reported previously - that Tesseract's page segmentation performs poorly on use cases which diverge greatly from what it was designed for, namely, large blocks of book style text. I would suggest that you do your own page segmentation first and then feed the resulting text segments to Tesseract for recognition. The search phrase "scene text detection" might give you some starting points to investigate. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef771efd-78db-4e06-9eae-db1bd7a9779cn%40googlegroups.com.