Thanks for the reply. I am now doing the text detection with openCV/EAST and then passing the bounding boxes to tesseract like this
api_->SetPageSegMode(tesseract::PSM_SPARSE_TEXT); api_->SetRectangle(rect->x - 1, rect->y - 1, rect->width + 2, rect->height + 2); api_->Recognize(nullptr); Adding an extra pixel on each side is a trick which, for some reason, increases the recognition accuracy a lot, even though the original bounding box detected by EAST already has some space around the characters. Adding more space, however, decreases the accuracy. This will obviously change from image to image so I have to do multiple attempts with different settings, which makes the overall process very slow. Setting PSM to "RAW_LINE" or "SINGLE_BLOCK" doesn't really make a difference. Am I missing something? On Friday, October 18, 2024 at 11:04:37 PM UTC+7 tfmo...@gmail.com wrote: > On Thursday, October 17, 2024 at 6:29:51 PM UTC-4 paul...@gmail.com wrote: > > > There must be some parameter that would force tesseract to return ALL text > blocks, not just the ones it considers more significant (which the large > paragraphs are). > > > Your investigations seem to confirm what has been widely reported > previously - that Tesseract's page segmentation performs poorly on use > cases which diverge greatly from what it was designed for, namely, large > blocks of book style text. > > I would suggest that you do your own page segmentation first and then feed > the resulting text segments to Tesseract for recognition. The search phrase > "scene text detection" might give you some starting points to investigate. > > Tom > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/d8bc3807-925f-40f1-9d05-bf9cad6b37d6n%40googlegroups.com.