On Wed, Mar 17, 2010 at 7:53 AM, Peng Yu <pengyu...@gmail.com> wrote: > On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin <pmau...@gmail.com> wrote: >> On Mar 4, 6:57 pm, Peng Yu <pengyu...@gmail.com> wrote: >>> I don't find a general pdf library in python that can do any >>> operations on pdfs. >>> >>> I want to automatically highlight certain words (using regex) in a >>> pdf. Could somebody let me know if there is a tool to do so in python? >> >> The problem with PDFs is that they can be quite complicated. There is >> the outer container structure, which isn't too bad (unless the >> document author applied encryption or fancy multi-object compression), >> but then inside the graphics elements, things could be stored as >> regular ASCII, or as fancy indexes into font-specific tables. Not >> rocket science, but the only industrial-strength solution for this is >> probably reportlab's pagecatcher. >> >> I have a library which works (primarily with the outer container) for >> reading and writing, called pdfrw. I also maintain a list of other >> PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It >> may be that pdfminer (link on that page) will do what you want -- it >> is certainly trying to be complete as a PDF reader. But I've never >> personally used pdfminer. >> >> One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools >> will read in preexisting PDFs and write them out to a reportlab >> canvas. This works quite well on a few very simple ASCII PDFs, but >> the font handling needs a lot of work and probably won't work at all >> right now on unicode. (But if you wanted to improve it, I certainly >> would accept patches or give you commit rights!) >> >> That pdfrw example does graphics reasonably well. I was actually >> going down that path for getting better vector graphics into rst2pdf >> (both uniconvertor and svglib were broken for my purposes), but then I >> realized that the PDF spec allows you to include a page from another >> PDF quite easily (the spec calls it a form xObject), so you don't >> actually need to parse down into the graphics stream for that. So, >> right now, the best way to do vector graphics with rst2pdf is either >> to give it a preexisting PDF (which it passes off to pdfrw for >> conversion into a form xObject), or to give it a .svg file and invoke >> it with -e inkscape, and then it will use inkscape to convert the svg >> to a pdf and then go through the same path. > > Thank you for your long reply! But I'm not sure if you get my question or not. > > Acrobat can highlight certain words in pdfs. I could add notes to the > highlighted words as well. However, I find that I frequently end up > with highlighting some words that can be expressed by a regular > expression. > > To improve my productivity, I don't want do this manually in Acrobat > but rather do it in an automatic way, if there is such a tool > available. People in reportlab mailing list said this is not possible > with reportlab. And I don't see PyPDF can do this. If you know there > is an API to for this purpose, please let me know. Thank you! > > Regards, > Peng > -- > http://mail.python.org/mailman/listinfo/python-list >
Take a look at the Acrobat SDK (http://www.adobe.com/devnet/acrobat/?view=downloads). In particular see the Acrobat Interapplication Communication information at http://www.adobe.com/devnet/acrobat/interapplication_communication.html. "Spell-checking a document" shows how to spell check a PDF using visual basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.17.html "Working with annotations" shows how to add an annotation with visual basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.16.html. Presumably combining the two examples with Python's win32com should allow you to do what you want. -- http://mail.python.org/mailman/listinfo/python-list