在 2013年9月11日星期三 UTC+8下午11:14:22,ch...@sc3.net写道: > > I'm trying to OCR some PDFs I have, and it's mostly successful - I'm using > GhostScript to convert my PDF pages into images, and I'm feeding those > images into Tesseract, and the magic is happening. > > What I'm struggling with, is that the position information that Tesseract > gives me doesn't seem to allow me to position the characters for display or > for creating a stream to insert back into my PDF file. > > If I want to create a PDF stream that draws the OCR'd text in place, I > need to know where the baseline of each character is; if I want to display > the OCR output on the Windows screen I have more flexibility but only if I > assume that Windows has generated the exact same font that Tesseract has > been trained on, which is probably not safe. > > So I've created an image that contains the text "Will o the wisp", and the > OCR is working well. For the "W" I get back the bounds of that glyph, and > for that particular character it's probably safe to assume the bottom of > the glyph is on the baseline. However, for the "p" I also get back the > bounds of the glyph, so the bottom y-coordinate is baseline minus descent, > which leaves me with no way to determine where its baseline is. So how do > I draw it accurately? > > I can ask Windows for the bounds of each glyph in its font, and I can use > that information to estimate the baseline in the Tesseract-generated data, > but I'm finding that is rather inaccurate. > > I assume other people have solved this problem already, is there something > obvious I'm missing? > > Thanks, > Chris > > p.s. I realize there is software out there that will OCR PDFs and do this > work for me - for my project, the OCR is part of a larger process and so I > really need to have more manual control. >
So, I wonder have you figure it out ? And how ? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/169f6cc4-ca8d-4e2f-8bd4-6cfd01210799%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.