Output of tesseract is not as useful without font baseline information?

chris Wed, 11 Sep 2013 08:38:52 -0700

I'm trying to OCR some PDFs I have, and it's mostly successful - I'm using 
GhostScript to convert my PDF pages into images, and I'm feeding those 
images into Tesseract, and the magic is happening.


What I'm struggling with, is that the position information that Tesseract 
gives me doesn't seem to allow me to position the characters for display or 
for creating a stream to insert back into my PDF file.

If I want to create a PDF stream that draws the OCR'd text in place, I need 
to know where the baseline of each character is; if I want to display the 
OCR output on the Windows screen I have more flexibility but only if I 
assume that Windows has generated the exact same font that Tesseract has 
been trained on, which is probably not safe.

So I've created an image that contains the text "Will o the wisp", and the 
OCR is working well.  For the "W" I get back the bounds of that glyph, and 
for that particular character it's probably safe to assume the bottom of 
the glyph is on the baseline.  However, for the "p" I also get back the 
bounds of the glyph, so the bottom y-coordinate is baseline minus descent, 
which leaves me with no way to determine where its baseline is.  So how do 
I draw it accurately?

I can ask Windows for the bounds of each glyph in its font, and I can use 
that information to estimate the baseline in the Tesseract-generated data, 
but I'm finding that is rather inaccurate.

I assume other people have solved this problem already, is there something 
obvious I'm missing?

Thanks,
Chris

p.s. I realize there is software out there that will OCR PDFs and do this 
work for me - for my project, the OCR is part of a larger process and so I 
really need to have more manual control.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Output of tesseract is not as useful without font baseline information?

Reply via email to