On Fri, 9 Apr 2010, Alban Mouton wrote:

Actually I choose the lazy solution and included tika-app jar, which if i
am correct already contains all dependencies (it's a standalone version of
tika) that you included one by one. I didn't test all formats with my
version of the wrapper but I think it should be fine as it is (maybe a few
more classes or --package but probably no --include).

--include is actually neat because all thus included jars are added to the classpath at runtime and are included in the tika egg. In other words, you can then take that egg and install it elsewhere without having to worry about carrying your maven repository around or manually setting your classpath during initVM() or via the environment.

Andi..

Very cool tika indeed ! And it covers functionalities that are missing in
the python world (to my knowledge anyway), so this wrapper might be
useful..

Alban


On Fri, 9 Apr 2010, Alban Mouton wrote:

Hello,

I didn't find specific data on the web to do this, except for this mail
:
http://www.mail-archive.com/pylucene-dev@lucene.apache.org/msg00577.html
JCC doc might be enough, but it won't hurt to add a few specifics.

With me that makes at least 2 persons who needed it.. Enough to set up a
small wiki page in my opinion :
http://redmine.djity.net/projects/pythontika/wiki

The wrapper seems to work fine, but it wasn't very much tested yet.

Thanks to JCC developers, it's a very useful piece of software !

A guy at work asked me the same question a couple of days ago.
Not knowing Tika, I helped him with the JCC command line aspects. He used
a
similar Tika example to what you're mentioning on your wiki page.

Lucene and Tika are a bit different in that Tika depends on a long list of
thirdparty libraries, all helpfully downloaded by Maven as you build Tika
into the local Maven repository in your home directory. Lucene is
standalone.

The approach we took was to --jar the tika core and tika parsers, getting
access to all public classes in these two jar file from Python and use
--include for all the other dependencies we found so that we avoid
generating wrappers for them but get them included in the resulting tika
egg. We also needed to allow wrapper generation for the java.io and
org.xml.sax packages by using --package and explicitely request
java.io.FileInputStream. I noticed you used class names with --package
there...

This is the JCC command we used for wrapping Tika 0.7 with Python 2.6.2:

python -m jcc.__main__ --shared --python tika --version 0.7 \
   --build --install \
   --jar tika-core/target/tika-core-0.7.jar \
   --jar tika-parsers/target/tika-parsers-0.7.jar \
   --package java.io java.io.FileInputStream \
   --package org.xml.sax \
   --include
~/.m2/repository/com/drewnoakes/metadata-extractor/2.4.0-beta-1/metadata-extractor-2.4.0-beta-1.jar
\
   --include ~/.m2/repository/org/apache/poi/poi/3.6/poi-3.6.jar \
   --include ~/.m2/repository/asm/asm/3.1/asm-3.1.jar \
   --include
~/.m2/repository/org/apache/poi/poi-ooxml/3.6/poi-ooxml-3.6.jar \
   --include
~/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar
\
   --include
~/.m2/repository/org/apache/pdfbox/pdfbox/1.1.0/pdfbox-1.1.0.jar \
   --include
~/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
\
   --include ~/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar \
   --include
~/.m2/repository/org/apache/commons/commons-compress/1.0/commons-compress-1.0.jar
\
   --include
~/.m2/repository/org/apache/poi/poi-scratchpad/3.6/poi-scratchpad-3.6.jar
\
   --include
~/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.6/poi-ooxml-schemas-3.6.jar
\
   --include
~/.m2/repository/org/ccil/cowan/tagsoup/tagsoup/1.2/tagsoup-1.2.jar \
   --include
~/.m2/repository/org/apache/pdfbox/fontbox/1.1.0/fontbox-1.1.0.jar

We kept adding --include pairs until we were able to run the example code
we
had in mind which looked like:
  >>> from tika import *
  >>> initVM()
  >>> metadata = Metadata()
  >>> handler = MetadataHandler(metadata, "foo")
  >>> parser = AutoDetectParser()
  >>> parser.parse(FileInputStream("image.jpg"), handler, metadata)
  >>> metadata
   <Metadata: Number of Components=3 Model=HP psc1300 Image Height=728
pixels
Data Precision=8 bits YCbCr Positioning=Datum point Reference
Black/White=[0,128,128] [255,255,255] Component 1=Y component:
Quantization
table 0, Sampling factors 2 horiz/2 vert Component 2=Cb component:
Quantization table 1, Sampling factors 1 horiz/1 vert Component 3=Cr
component: Quantization table 1, Sampling factors 1 horiz/1 vert X
Resolution=200 dots per inch Resolution Unit=Inch Image Width=1114 pixels
Content-Type=image/jpeg Y Resolution=200 dots per inch Make=HP >

Very cool, Tika !!

I'm sure more --include pairs are necessary for supporting other formats
we
haven't tested but you get the idea...

I hope this helps !

Andi..



Reply via email to