Extracting text at a lower level than PDFTextStripper provides?

Dave Trombley Tue, 02 May 2023 06:29:23 -0700

Hi all!

I'm trying to extract some specially formatted text from a PDF, and it
seems like it will be impossible to use PDFTextStripper for this task.  In
particular, some of the font style (bold / italic /etc) and color
information is semantically relevant, and what is considered a "paragraph"
depends on this information.


What would be ideal is if there were a way to have a callback of mine
called for each glyph on the page, containing its font, color, size, glyph,
and location in translated / simple page coordinates.  Is there a way to do
something like that?

I've looked at some of the classes that PDFTextStripper derives from, but
it's not clear to me how these work and they seem to have TOO much
information, not at all a simple view of the characters / text
themselves.

Can anyone provide a suggestion?

Thanks,
  David

Extracting text at a lower level than PDFTextStripper provides?

Reply via email to