Hi all! I'm trying to extract some specially formatted text from a PDF, and it seems like it will be impossible to use PDFTextStripper for this task. In particular, some of the font style (bold / italic /etc) and color information is semantically relevant, and what is considered a "paragraph" depends on this information.
What would be ideal is if there were a way to have a callback of mine called for each glyph on the page, containing its font, color, size, glyph, and location in translated / simple page coordinates. Is there a way to do something like that? I've looked at some of the classes that PDFTextStripper derives from, but it's not clear to me how these work and they seem to have TOO much information, not at all a simple view of the characters / text themselves. Can anyone provide a suggestion? Thanks, David