On Tue, 10 Mar 2009, Christian Heimes wrote:

I've attached an isolated testcase for you. You'll surely recognize the
make file. It's based on your make file from PyLucene. I hope you don't
mind ;)

Thank you, that is very helpful in debugging this.

But first, please do not contact me off list. Use the pylucene-dev@lucene.apache.org mailing list. Your issue is of interest to others.

The reason for the error is that you're calling one of your native extension methods, startDocument, from the PyPDFTextStripper constructor.

While this is valid Java, it violates an unstated constraint of the code generated by JCC: after the Java constructor returns, JCC generated code to finish initializing the object, calling the pythonExtension(pythonObject) method.

The problem with this sequence of events is that if you call a native extension method from the constructor, the python object to call a method on from that native method is not yet set on the Java instance. In other words, inside the constructor, the native extension methods such as startDocument() depend on state on the instance that is not yet set. In order to set that state, the object has to be constructed first, so we're in a bit of a catch-22 here.

It is possible to remove this constraint by changing the extension protocol such that _all_ extension class constructors require a first parameter, that 'pythonObject' long (in fact, the python instance pointer, the python self), and set it to the pythonObject instance variable. This is ugly though, so it needs more thought. At least, some code should be added to check for this condition.

In the meantime, the workaround is simple: move the offending code to its own method and call it after the constructor returns. I attached the modified PyPDFTextStripper.java class and test case that now work.

Andi..


$ python2.5 tests/test_textstripper.py
Loading: /home/heimes/software/misc/pdfbox/pdfbox-0.8.0/test/input/warp.pdf
E
======================================================================
ERROR: test_subclass (__main__.TestTextStripper)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "tests/test_textstripper.py", line 24, in test_subclass
   Stripper(PDF)
SystemError: NULL result without error in PyObject_Call

----------------------------------------------------------------------
Ran 1 test in 0.264s

FAILED (errors=1)
package de.semantics.pdfbox;

import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

public class PyPDFTextStripper extends PDFTextStripper {

    private PDDocument document;
    private long pythonObject;
    
    public PyPDFTextStripper(String filename)
        throws IOException
    {
        System.out.println( "Loading: " + filename );
        document = PDDocument.load(filename);
        System.out.println( "PDDocument.load(filename);");
    }

    public void process()
        throws IOException
    {
        List allPages = document.getDocumentCatalog().getAllPages();
        System.out.println( "document.getDocumentCatalog().getAllPages();");
        startDocument(document);
        System.out.println( "startDocument(document);");
        for( int i=0; i<allPages.size(); i++ )
        {
            System.out.println( "page " + i);
            PDPage page = (PDPage)allPages.get( i );
            System.out.println( "Processing page: " + i );
            PDStream contents = page.getContents();
            if( contents != null )
            {
                processStream(page, page.findResources(), 
page.getContents().getStream());
            }
        }        
    }

    public void pythonExtension(long pythonObject)
    {
        this.pythonObject = pythonObject;
    }
    
    public long pythonExtension()
    {
        return this.pythonObject;
    }

    public void finalize()
        throws Throwable
    {
        pythonDecRef();
    }

    public native void pythonDecRef();
    
    //public native void endArticle();
    
    /*public native void startDocument(PDDocument pdf);
    public native void endDocument(PDDocument pdf );
    public native void startArticle(boolean isltr);
    public native void endArticle();
    public native void startPage(PDPage page);
    public native void endPage(PDPage page);
    public native void writePageSeperator();
    public native void writeLineSeparator();
    public native void writeWordSeparator();
    public native void writeCharacters(TextPosition text);
    public native void writeString(String text);
    */
    public native void processTextPosition( TextPosition text );
    
    public native void startDocument(PDDocument pdf);
    public native void startPage(PDPage page);
    
}
import os
import unittest

import pdfbox

HERE = os.path.dirname(os.path.abspath(__file__))
PDF = os.path.abspath(os.path.join(HERE, os.pardir, 
                                   "pdfbox-0.8.0/test/input/warp.pdf"))


class Stripper(pdfbox.PyPDFTextStripper):

    def processTextPosition(self, text):
        print text
    
    def startDocument(self, pdf):
        print pdf
        
    def startPage(self, page):
        print page


class TestTextStripper(unittest.TestCase):
    def test_subclass(self):
        stripper = Stripper(PDF)
        stripper.process()


if __name__ == "__main__":
    pdfbox.initVM(classpath=pdfbox.CLASSPATH, vmargs='-Djava.awt.headless=true')
    unittest.main()

Reply via email to