On Sep 26, 4:49 pm, Svenn Are Bjerkem <[EMAIL PROTECTED]> wrote: > I have downloaded this package and installed it and found that the > text-extraction is more or less useless. Looking into the code and > comparing with the PDF spec show a very early implementation of text > extraction. Luckily it is possible to overwrite the textextraction > method in the base class without having to fiddle with the original > code. I tried to contact the developer to offer some help on > implementing text extraction, but he didn't answer my emails. > -- > Svenn
Well, feel free to send any ideas or help to me! It seems simple... Do a binary read. Find 'stream' and 'endstream' sections. zlib.decompress() all the streams. Find BT and ET markers (Begin Text & End Text) and finally locate the parens within those and string the text together. This works great on 3 out of 10 PDF documents, but my main issue seems to be the zlib compressed streams. Some of them don't seem to be FlateDecodeable (although they claim to be) or the header is somehow incorrect. But, once I get a good stream and decompress it, things are OK from that point on. Seriously, if you have ideas, please let me know. I'll be glad to share what I've got so far. Not many people seem to be interested. I'll stop adding to this thread... I don't want to beat a dead horse. Anyone interested in helping, can contact me via emial. Thanks, Brad -- http://mail.python.org/mailman/listinfo/python-list