Re: Wrapping PDFBox with JCC

Christian Heimes Mon, 09 Mar 2009 14:49:39 -0700

Andi Vajda wrote
> > After both these fixes, I was able to build wrappers for pdfbox:
> >
> >   >>> from pdfbox import *
> >   >>> initVM(CLASSPATH, vmargs='-Djava.awt.headless=true')
> >   <jcc.JCCEnv object at 0x295c0>
> >   >>>
> >
> > This is all checked into rev 751772.
> >
> > Please let me know if this works for you, I'd like to get a PyLucene
> > 2.4.1 release started now that Java Lucene 2.4.1 has been released. If I
> > broke something while doing these non-trivial fixes, now is the time to
> > find out.


Thanks Andi!

I was able to build a pdfbox wrapper with your changes, too. The changes
to setup.py makes it much easier to get the script working. Good work!

As a JCC and Java newbie I didn't understand the difference between
--jar, --include and --classpath at first. Could you please extend the
README in order to explain the three options?

Today I've started to play with subclassable Python wrappers. I couldn't
get the appended example to work. I run into several issues like
"SystemError: NULL result without error in PyObject_Call". Could you
have a look, please? The jar with PyPDFTextStripper was wrapped together
with the pdfbox jar.

public class PyPDFTextStripper extends PDFTextStripper {

        private PDDocument document;
        private long pythonObject;
        
        public PyPDFTextStripper(String filename) throws IOException
        {
                System.out.println( "Loading: " + filename );
                document = PDDocument.load(filename);
                List allPages = document.getDocumentCatalog().getAllPages();
                startDocument(document);
                for( int i=0; i<allPages.size(); i++ )
                {
                    PDPage page = (PDPage)allPages.get( i );
                    System.out.println( "Processing page: " + i );
                    PDStream contents = page.getContents();
                    if( contents != null )
                    {
                        processStream(page, page.findResources(),
page.getContents().getStream());
                    }
                }
        }

        public void pythonExtension(long pythonObject)
    {
        this.pythonObject = pythonObject;
    }

        public long pythonExtension()
    {
        return this.pythonObject;
    }

    public void finalize()
        throws Throwable
    {
        pythonDecRef();
    }

    public native void pythonDecRef();

    public native void processTextPosition( TextPosition text );
    public native void startDocument(PDDocument pdf);
    public native void startPage(PDPage page);
}


pdfbox.initVM(classpath=pdfbox.CLASSPATH)

class Stripper(pdfbox.PyPDFTextStripper):
    """
    """
    def processTextPosition(self, text):
        print text

    def startDocument(self, doc):
        print doc

    def startArticle(self, isltr):
        print isltr

Re: Wrapping PDFBox with JCC

Reply via email to