On Tue, 10 Mar 2009, Christian Heimes wrote:
I've attached an isolated testcase for you. You'll surely recognize the
make file. It's based on your make file from PyLucene. I hope you don't
mind ;)
Thank you, that is very helpful in debugging this.
But first, please do not contact me off list. Use the
pylucene-dev@lucene.apache.org mailing list. Your issue is of interest to
others.
The reason for the error is that you're calling one of your native extension
methods, startDocument, from the PyPDFTextStripper constructor.
While this is valid Java, it violates an unstated constraint of the code
generated by JCC: after the Java constructor returns, JCC generated code to
finish initializing the object, calling the pythonExtension(pythonObject)
method.
The problem with this sequence of events is that if you call a native
extension method from the constructor, the python object to call a method on
from that native method is not yet set on the Java instance. In other words,
inside the constructor, the native extension methods such as startDocument()
depend on state on the instance that is not yet set.
In order to set that state, the object has to be constructed first, so we're
in a bit of a catch-22 here.
It is possible to remove this constraint by changing the extension protocol
such that _all_ extension class constructors require a first parameter, that
'pythonObject' long (in fact, the python instance pointer, the python self),
and set it to the pythonObject instance variable. This is ugly though, so
it needs more thought. At least, some code should be added to check for this
condition.
In the meantime, the workaround is simple: move the offending code to its
own method and call it after the constructor returns.
I attached the modified PyPDFTextStripper.java class and test case that now
work.
Andi..
$ python2.5 tests/test_textstripper.py
Loading: /home/heimes/software/misc/pdfbox/pdfbox-0.8.0/test/input/warp.pdf
E
======================================================================
ERROR: test_subclass (__main__.TestTextStripper)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_textstripper.py", line 24, in test_subclass
Stripper(PDF)
SystemError: NULL result without error in PyObject_Call
----------------------------------------------------------------------
Ran 1 test in 0.264s
FAILED (errors=1)
package de.semantics.pdfbox;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
public class PyPDFTextStripper extends PDFTextStripper {
private PDDocument document;
private long pythonObject;
public PyPDFTextStripper(String filename)
throws IOException
{
System.out.println( "Loading: " + filename );
document = PDDocument.load(filename);
System.out.println( "PDDocument.load(filename);");
}
public void process()
throws IOException
{
List allPages = document.getDocumentCatalog().getAllPages();
System.out.println( "document.getDocumentCatalog().getAllPages();");
startDocument(document);
System.out.println( "startDocument(document);");
for( int i=0; i<allPages.size(); i++ )
{
System.out.println( "page " + i);
PDPage page = (PDPage)allPages.get( i );
System.out.println( "Processing page: " + i );
PDStream contents = page.getContents();
if( contents != null )
{
processStream(page, page.findResources(),
page.getContents().getStream());
}
}
}
public void pythonExtension(long pythonObject)
{
this.pythonObject = pythonObject;
}
public long pythonExtension()
{
return this.pythonObject;
}
public void finalize()
throws Throwable
{
pythonDecRef();
}
public native void pythonDecRef();
//public native void endArticle();
/*public native void startDocument(PDDocument pdf);
public native void endDocument(PDDocument pdf );
public native void startArticle(boolean isltr);
public native void endArticle();
public native void startPage(PDPage page);
public native void endPage(PDPage page);
public native void writePageSeperator();
public native void writeLineSeparator();
public native void writeWordSeparator();
public native void writeCharacters(TextPosition text);
public native void writeString(String text);
*/
public native void processTextPosition( TextPosition text );
public native void startDocument(PDDocument pdf);
public native void startPage(PDPage page);
}
import os
import unittest
import pdfbox
HERE = os.path.dirname(os.path.abspath(__file__))
PDF = os.path.abspath(os.path.join(HERE, os.pardir,
"pdfbox-0.8.0/test/input/warp.pdf"))
class Stripper(pdfbox.PyPDFTextStripper):
def processTextPosition(self, text):
print text
def startDocument(self, pdf):
print pdf
def startPage(self, page):
print page
class TestTextStripper(unittest.TestCase):
def test_subclass(self):
stripper = Stripper(PDF)
stripper.process()
if __name__ == "__main__":
pdfbox.initVM(classpath=pdfbox.CLASSPATH, vmargs='-Djava.awt.headless=true')
unittest.main()