I have a very crude Python script that extracts text from some (and I emphasize some) PDF documents. On many PDF docs, I cannot extract text, but this is because I'm doing something wrong. The PDF spec is large and complex and there are various ways in which to store and encode text. I wanted to post here and ask if anyone is interested in helping make the script better which means it should accurately extract text from most any pdf file... not just some.
I know the topic of reading/extracting the text from a PDF document natively in Python comes up every now and then on comp.lang.python... I've posted about it in the past myself. After searching for other solutions, I've resorted to attempting this on my own in my spare time. Using apps external to Python (pdftotext, etc.) is not really an option for me. If someone knows of a free native Python app that does this now, let me know and I'll use that instead! So, if other more experienced programmer are interested in helping make the script better, please let me know. I can host a website and the latest revision and do all of the grunt work. Thanks, Brad -- http://mail.python.org/mailman/listinfo/python-list