[tesseract-ocr] How to get para & line boxes of word from ResultIterator

Lakshman Kumar Wed, 19 Jul 2023 23:18:10 -0700

Hi All,

Currently am doing OCR line by line and getting words details from 
ResultIterator like below


tessAPI->SetPageSegMode(tesseract::PageSegMode::PSM_SINGLE_LINE);
tessAPI->SetRectangle(iXmin, iYmin, iW, iH); //these line boxes are being 
calculated by our pre-processing and segmentation code)
tessAPI->Recognize(nullptr);
tesseract::ResultIterator* rst_iter = tessAPI->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (nullptr != rst_iter)
{
do
{ 
const char* text = rst_iter->GetUTF8Text(level);
                 rst_iter->WordFontAttributes(&is_bold, &is_italic, 
&is_underlined, &is_monospace, &is_serif, &is_smallcaps, &pointsize, 
&font_id);
                 //here I want to get the line & para of the current word 
belongs to from tess API

} while (rst_iter->Next(level));
}

I can get paras/lines/words using tessAPI->GetComponentImages() function, 
but for words only can get block/paras only. Somehow I am mapping those 
words with lines, but still getting some garbage. 

Is there any way to get the line & para of the current word belongs to?

Thanks in advance,
Lakshman.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c3e96d5a-0260-4f8b-9269-829128052b96n%40googlegroups.com.

[tesseract-ocr] How to get para & line boxes of word from ResultIterator

Reply via email to