On Thursday 25 January 2007 22:05, tubby wrote: > I know this question comes up a lot, so here goes again. I want to read > text from a PDF file, run re searches on the text, etc. I do not care > about layout, fonts, borders, etc. I just want the text. I've been > reading Adobe's PDF Reference Guide and I'm beginning to develop a > better understanding of PDF in general, but I need a bit of help... this > seems like it should be easier than it is.
It _seems_ that way. ;-) One of the more promising suggestions for a way to solve this came up in a comp.lang.python thread last year: http://groups.google.com/group/comp.lang.python/msg/cb6c97a44ce4cbe9?dmode=source Basically, if you have access to the pdftotext command on a system that supports xpdf, you should be able to get something reasonable out of a PDF file. > I know the text is compressed... that it would have stream and endstream > makers and BT (Begin Text) and ET (End Text) and that the uncompressed > text is enclosed in parenthesis (this is my text). Has anyone here done > this in a simple fashion? I've played with the pyPdf library some, but > it seems overly complex for my needs (merge PDFs, write PDFs, etc). I > just want a simple PDF text extractor. The pdftotext tool may do what you want: http://www.foolabs.com/xpdf/download.html Let us know how you get on with it. David -- http://mail.python.org/mailman/listinfo/python-list