在 2013年9月11日星期三 UTC+8下午11:14:22,ch...@sc3.net写道:
>
> I'm trying to OCR some PDFs I have, and it's mostly successful - I'm using 
> GhostScript to convert my PDF pages into images, and I'm feeding those 
> images into Tesseract, and the magic is happening.
>
> What I'm struggling with, is that the position information that Tesseract 
> gives me doesn't seem to allow me to position the characters for display or 
> for creating a stream to insert back into my PDF file.
>
> If I want to create a PDF stream that draws the OCR'd text in place, I 
> need to know where the baseline of each character is; if I want to display 
> the OCR output on the Windows screen I have more flexibility but only if I 
> assume that Windows has generated the exact same font that Tesseract has 
> been trained on, which is probably not safe.
>
> So I've created an image that contains the text "Will o the wisp", and the 
> OCR is working well.  For the "W" I get back the bounds of that glyph, and 
> for that particular character it's probably safe to assume the bottom of 
> the glyph is on the baseline.  However, for the "p" I also get back the 
> bounds of the glyph, so the bottom y-coordinate is baseline minus descent, 
> which leaves me with no way to determine where its baseline is.  So how do 
> I draw it accurately?
>
> I can ask Windows for the bounds of each glyph in its font, and I can use 
> that information to estimate the baseline in the Tesseract-generated data, 
> but I'm finding that is rather inaccurate.
>
> I assume other people have solved this problem already, is there something 
> obvious I'm missing?
>
> Thanks,
> Chris
>
> p.s. I realize there is software out there that will OCR PDFs and do this 
> work for me - for my project, the OCR is part of a larger process and so I 
> really need to have more manual control.
>


So, I wonder have you figure it out ? And how ? 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/169f6cc4-ca8d-4e2f-8bd4-6cfd01210799%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to