On Fri, 9 Apr 2010, Alban Mouton wrote:
Hello,
I didn't find specific data on the web to do this, except for this mail :
http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
JCC doc might be enough, but it won't hurt to add a few specifics.
With me that makes at least 2 persons who needed it.. Enough to set up a
small wiki page in my opinion :
http://redmine.djity.net/projects/pythontika/wiki
The wrapper seems to work fine, but it wasn't very much tested yet.
Thanks to JCC developers, it's a very useful piece of software !
A guy at work asked me the same question a couple of days ago.
Not knowing Tika, I helped him with the JCC command line aspects. He used a
similar Tika example to what you're mentioning on your wiki page.
Lucene and Tika are a bit different in that Tika depends on a long list of
thirdparty libraries, all helpfully downloaded by Maven as you build Tika
into the local Maven repository in your home directory. Lucene is
standalone.
The approach we took was to --jar the tika core and tika parsers, getting
access to all public classes in these two jar file from Python and use
--include for all the other dependencies we found so that we avoid
generating wrappers for them but get them included in the resulting tika
egg. We also needed to allow wrapper generation for the java.io and
org.xml.sax packages by using --package and explicitely request
java.io.FileInputStream. I noticed you used class names with --package
there...
This is the JCC command we used for wrapping Tika 0.7 with Python 2.6.2:
python -m jcc.__main__ --shared --python tika --version 0.7 \
--build --install \
--jar tika-core/target/tika-core-0.7.jar \
--jar tika-parsers/target/tika-parsers-0.7.jar \
--package java.io java.io.FileInputStream \
--package org.xml.sax \
--include
~/.m2/repository/com/drewnoakes/metadata-extractor/2.4.0-beta-1/metadata-extractor-2.4.0-beta-1.jar
\
--include ~/.m2/repository/org/apache/poi/poi/3.6/poi-3.6.jar \
--include ~/.m2/repository/asm/asm/3.1/asm-3.1.jar \
--include ~/.m2/repository/org/apache/poi/poi-ooxml/3.6/poi-ooxml-3.6.jar \
--include
~/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar \
--include ~/.m2/repository/org/apache/pdfbox/pdfbox/1.1.0/pdfbox-1.1.0.jar \
--include
~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
\
--include ~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar \
--include
~/.m2/repository/org/apache/commons/commons-compress/1.0/commons-compress-1.0.jar
\
--include
~/.m2/repository/org/apache/poi/poi-scratchpad/3.6/poi-scratchpad-3.6.jar \
--include
~/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.6/poi-ooxml-schemas-3.6.jar
\
--include ~/.m2/repository/org/ccil/cowan/tagsoup/tagsoup/1.2/tagsoup-1.2.jar
\
--include ~/.m2/repository/org/apache/pdfbox/fontbox/1.1.0/fontbox-1.1.0.jar
We kept adding --include pairs until we were able to run the example code we
had in mind which looked like:
>>> from tika import *
>>> initVM()
>>> metadata = Metadata()
>>> handler = MetadataHandler(metadata, "foo")
>>> parser = AutoDetectParser()
>>> parser.parse(FileInputStream("image.jpg"), handler, metadata)
>>> metadata
<Metadata: Number of Components=3 Model=HP psc1300 Image Height=728 pixels
Data Precision=8 bits YCbCr Positioning=Datum point Reference
Black/White=[0,128,128] [255,255,255] Component 1=Y component: Quantization
table 0, Sampling factors 2 horiz/2 vert Component 2=Cb component:
Quantization table 1, Sampling factors 1 horiz/1 vert Component 3=Cr
component: Quantization table 1, Sampling factors 1 horiz/1 vert X
Resolution=200 dots per inch Resolution Unit=Inch Image Width=1114 pixels
Content-Type=image/jpeg Y Resolution=200 dots per inch Make=HP >
Very cool, Tika !!
I'm sure more --include pairs are necessary for supporting other formats we
haven't tested but you get the idea...
I hope this helps !
Andi..