Re: highlight words by regex in pdf files using python

TP Thu, 18 Mar 2010 12:43:03 -0700

On Wed, Mar 17, 2010 at 7:53 AM, Peng Yu <[email protected]> wrote:
> On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin <[email protected]> wrote:
>> On Mar 4, 6:57 pm, Peng Yu <[email protected]> wrote:
>>> I don't find a general pdf library in python that can do any
>>> operations on pdfs.
>>>
>>> I want to automatically highlight certain words (using regex) in a
>>> pdf. Could somebody let me know if there is a tool to do so in python?
>>
>> The problem with PDFs is that they can be quite complicated.  There is
>> the outer container structure, which isn't too bad (unless the
>> document author applied encryption or fancy multi-object compression),
>> but then inside the graphics elements, things could be stored as
>> regular ASCII, or as fancy indexes into font-specific tables.  Not
>> rocket science, but the only industrial-strength solution for this is
>> probably reportlab's pagecatcher.
>>
>> I have a library which works (primarily with the outer container) for
>> reading and writing, called pdfrw.  I also maintain a list of other
>> PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries  It
>> may be that pdfminer (link on that page) will do what you want -- it
>> is certainly trying to be complete as a PDF reader.  But I've never
>> personally used pdfminer.
>>
>> One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools
>> will read in preexisting PDFs and write them out to a reportlab
>> canvas.  This works quite well on a few very simple ASCII PDFs, but
>> the font handling needs a lot of work and probably won't work at all
>> right now on unicode.  (But if you wanted to improve it, I certainly
>> would accept patches or give you commit rights!)
>>
>> That pdfrw example does graphics reasonably well.  I was actually
>> going down that path for getting better vector graphics into rst2pdf
>> (both uniconvertor and svglib were broken for my purposes), but then I
>> realized that the PDF spec allows you to include a page from another
>> PDF quite easily (the spec calls it a form xObject), so you don't
>> actually need to parse down into the graphics stream for that.  So,
>> right now, the best way to do vector graphics with rst2pdf is either
>> to give it a preexisting PDF (which it passes off to pdfrw for
>> conversion into a form xObject), or to give it a .svg file and invoke
>> it with -e inkscape, and then it will use inkscape to convert the svg
>> to a pdf and then go through the same path.
>
> Thank you for your long reply! But I'm not sure if you get my question or not.
>
> Acrobat can highlight certain words in pdfs. I could add notes to the
> highlighted words as well. However, I find that I frequently end up
> with highlighting some words that can be expressed by a regular
> expression.
>
> To improve my productivity, I don't want do this manually in Acrobat
> but rather do it in an automatic way, if there is such a tool
> available. People in reportlab mailing list said this is not possible
> with reportlab. And I don't see PyPDF can do this. If you know there
> is an API to for this purpose, please let me know. Thank you!
>
> Regards,
> Peng
> --
> http://mail.python.org/mailman/listinfo/python-list
>


Take a look at the Acrobat SDK
(http://www.adobe.com/devnet/acrobat/?view=downloads). In particular
see the Acrobat Interapplication Communication information at
http://www.adobe.com/devnet/acrobat/interapplication_communication.html.

"Spell-checking a document" shows how to spell check a PDF using
visual basic at
http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.17.html

"Working with annotations" shows how to add an annotation with visual
basic at 
http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.16.html.

Presumably combining the two examples with Python's win32com should
allow you to do what you want.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: highlight words by regex in pdf files using python

Reply via email to