On Thursday, October 17, 2024 at 6:29:51 PM UTC-4 paul...@gmail.com wrote:


There must be some parameter that would force tesseract to return ALL text 
blocks, not just the ones it considers more significant (which the large 
paragraphs are).

 
Your investigations seem to confirm what has been widely reported 
previously - that Tesseract's page segmentation performs poorly on use 
cases which diverge greatly from what it was designed for, namely, large 
blocks of book style text.

I would suggest that you do your own page segmentation first and then feed 
the resulting text segments to Tesseract for recognition. The search phrase 
"scene text detection" might give you some starting points to investigate.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ef771efd-78db-4e06-9eae-db1bd7a9779cn%40googlegroups.com.

Reply via email to