On Fri, 9 Apr 2010, Alban Mouton wrote:

Hello,

I didn't find specific data on the web to do this, except for this mail :
http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
JCC doc might be enough, but it won't hurt to add a few specifics.

With me that makes at least 2 persons who needed it.. Enough to set up a
small wiki page in my opinion :
http://redmine.djity.net/projects/pythontika/wiki

The wrapper seems to work fine, but it wasn't very much tested yet.

Thanks to JCC developers, it's a very useful piece of software !

A guy at work asked me the same question a couple of days ago.
Not knowing Tika, I helped him with the JCC command line aspects. He used a
similar Tika example to what you're mentioning on your wiki page.

Lucene and Tika are a bit different in that Tika depends on a long list of thirdparty libraries, all helpfully downloaded by Maven as you build Tika into the local Maven repository in your home directory. Lucene is standalone.

The approach we took was to --jar the tika core and tika parsers, getting access to all public classes in these two jar file from Python and use --include for all the other dependencies we found so that we avoid generating wrappers for them but get them included in the resulting tika egg. We also needed to allow wrapper generation for the java.io and org.xml.sax packages by using --package and explicitely request java.io.FileInputStream. I noticed you used class names with --package there...

This is the JCC command we used for wrapping Tika 0.7 with Python 2.6.2:

python -m jcc.__main__ --shared --python tika --version 0.7 \
  --build --install \
  --jar tika-core/target/tika-core-0.7.jar \
  --jar tika-parsers/target/tika-parsers-0.7.jar \
  --package java.io java.io.FileInputStream \
  --package org.xml.sax \
  --include 
~/.m2/repository/com/drewnoakes/metadata-extractor/2.4.0-beta-1/metadata-extractor-2.4.0-beta-1.jar
 \
  --include ~/.m2/repository/org/apache/poi/poi/3.6/poi-3.6.jar \
  --include ~/.m2/repository/asm/asm/3.1/asm-3.1.jar \
  --include ~/.m2/repository/org/apache/poi/poi-ooxml/3.6/poi-ooxml-3.6.jar \
  --include 
~/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar \
  --include ~/.m2/repository/org/apache/pdfbox/pdfbox/1.1.0/pdfbox-1.1.0.jar \
  --include 
~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
 \
  --include ~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar \
  --include 
~/.m2/repository/org/apache/commons/commons-compress/1.0/commons-compress-1.0.jar
 \
  --include 
~/.m2/repository/org/apache/poi/poi-scratchpad/3.6/poi-scratchpad-3.6.jar \
  --include 
~/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.6/poi-ooxml-schemas-3.6.jar 
\
  --include ~/.m2/repository/org/ccil/cowan/tagsoup/tagsoup/1.2/tagsoup-1.2.jar 
\
  --include ~/.m2/repository/org/apache/pdfbox/fontbox/1.1.0/fontbox-1.1.0.jar

We kept adding --include pairs until we were able to run the example code we had in mind which looked like:
  >>> from tika import *
  >>> initVM()
  >>> metadata = Metadata()
  >>> handler = MetadataHandler(metadata, "foo")
  >>> parser = AutoDetectParser()
  >>> parser.parse(FileInputStream("image.jpg"), handler, metadata)
  >>> metadata
<Metadata: Number of Components=3 Model=HP psc1300 Image Height=728 pixels Data Precision=8 bits YCbCr Positioning=Datum point Reference Black/White=[0,128,128] [255,255,255] Component 1=Y component: Quantization table 0, Sampling factors 2 horiz/2 vert Component 2=Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert Component 3=Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert X Resolution=200 dots per inch Resolution Unit=Inch Image Width=1114 pixels Content-Type=image/jpeg Y Resolution=200 dots per inch Make=HP >

Very cool, Tika !!

I'm sure more --include pairs are necessary for supporting other formats we haven't tested but you get the idea...

I hope this helps !

Andi..

Reply via email to