Re: Python wrapper for Tika using JCC

Andi Vajda Fri, 09 Apr 2010 10:32:20 -0700


On Fri, 9 Apr 2010, Alban Mouton wrote:

Hello,

I didn't find specific data on the web to do this, except for this mail :
http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
JCC doc might be enough, but it won't hurt to add a few specifics.

With me that makes at least 2 persons who needed it.. Enough to set up a
small wiki page in my opinion :
http://redmine.djity.net/projects/pythontika/wiki

The wrapper seems to work fine, but it wasn't very much tested yet.

Thanks to JCC developers, it's a very useful piece of software !


A guy at work asked me the same question a couple of days ago.
Not knowing Tika, I helped him with the JCC command line aspects. He used a
similar Tika example to what you're mentioning on your wiki page.

Lucene and Tika are a bit different in that Tika depends on a long list ofthirdparty libraries, all helpfully downloaded by Maven as you build Tikainto the local Maven repository in your home directory. Lucene isstandalone.

The approach we took was to --jar the tika core and tika parsers, gettingaccess to all public classes in these two jar file from Python and use--include for all the other dependencies we found so that we avoidgenerating wrappers for them but get them included in the resulting tikaegg. We also needed to allow wrapper generation for the java.io andorg.xml.sax packages by using --package and explicitely requestjava.io.FileInputStream. I noticed you used class names with --packagethere...


This is the JCC command we used for wrapping Tika 0.7 with Python 2.6.2:

python -m jcc.__main__ --shared --python tika --version 0.7 \
  --build --install \
  --jar tika-core/target/tika-core-0.7.jar \
  --jar tika-parsers/target/tika-parsers-0.7.jar \
  --package java.io java.io.FileInputStream \
  --package org.xml.sax \
  --include 
~/.m2/repository/com/drewnoakes/metadata-extractor/2.4.0-beta-1/metadata-extractor-2.4.0-beta-1.jar
 \
  --include ~/.m2/repository/org/apache/poi/poi/3.6/poi-3.6.jar \
  --include ~/.m2/repository/asm/asm/3.1/asm-3.1.jar \
  --include ~/.m2/repository/org/apache/poi/poi-ooxml/3.6/poi-ooxml-3.6.jar \
  --include 
~/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar \
  --include ~/.m2/repository/org/apache/pdfbox/pdfbox/1.1.0/pdfbox-1.1.0.jar \
  --include 
~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
 \
  --include ~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar \
  --include 
~/.m2/repository/org/apache/commons/commons-compress/1.0/commons-compress-1.0.jar
 \
  --include 
~/.m2/repository/org/apache/poi/poi-scratchpad/3.6/poi-scratchpad-3.6.jar \
  --include 
~/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.6/poi-ooxml-schemas-3.6.jar 
\
  --include ~/.m2/repository/org/ccil/cowan/tagsoup/tagsoup/1.2/tagsoup-1.2.jar 
\
  --include ~/.m2/repository/org/apache/pdfbox/fontbox/1.1.0/fontbox-1.1.0.jar

We kept adding --include pairs until we were able to run the example code wehad in mind which looked like:

  >>> from tika import *
  >>> initVM()
  >>> metadata = Metadata()
  >>> handler = MetadataHandler(metadata, "foo")
  >>> parser = AutoDetectParser()
  >>> parser.parse(FileInputStream("image.jpg"), handler, metadata)
  >>> metadata

<Metadata: Number of Components=3 Model=HP psc1300 Image Height=728 pixelsData Precision=8 bits YCbCr Positioning=Datum point ReferenceBlack/White=[0,128,128] [255,255,255] Component 1=Y component: Quantizationtable 0, Sampling factors 2 horiz/2 vert Component 2=Cb component:Quantization table 1, Sampling factors 1 horiz/1 vert Component 3=Crcomponent: Quantization table 1, Sampling factors 1 horiz/1 vert XResolution=200 dots per inch Resolution Unit=Inch Image Width=1114 pixelsContent-Type=image/jpeg Y Resolution=200 dots per inch Make=HP >


Very cool, Tika !!

I'm sure more --include pairs are necessary for supporting other formats wehaven't tested but you get the idea...


I hope this helps !

Andi..

Re: Python wrapper for Tika using JCC

Reply via email to