Hey Jukka, For places like POI and PDFBox I think this could definitely work. And then for places where we have Parsers, but aren't ready to push upstream yet (I can think of two examples of this relevant to me, NetCDF/HDF and GDAL), we can just leave the Parser in tika-parsers I think.
In this manner, what you're really suggesting is that it would be great for our mature Parsers to be "promoted" upstream to the communities that really understand the underlying Parser implementation toolkit. I think this makes sense to me, so long as there is a Champion or someone in that community willing to spend the small amount of time to learn Tika and its interfaces (if they haven't done so already). The net effect to the casual Tika user is nil, since we have Parser loading via service factories, and the only thing that'll change there is the package name (and potentially the class name) but it's all behind the scenes. The net effect to the Tika developer is that the class and package name changes may cause folks to have to recompile code/etc., and the code/unit tests/maintenance of some of the parsers would no longer be readily available in Tika's tika-parsers artifact, but would live in the tika-parser dependency library upstream. Cheers, Chris On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote: > Hi, > > As you know, we see a lot of questions about version mismatches (which > POI or PDFBox version should go with this Tika version) and there's a > long queue of patches that are waiting for new official releases of > our upstream dependencies to become available. > > To avoid this issue I propose that we start moving some of our parser > implementations to upstream projects. Now with Tika 1.0 out we have a > stable Parser and Detector interfaces and related APIs that upstream > libraries could implement directly without us having to worry about > changing Tika code whenever a new version of a parser library becomes > available. > > This would allow our users to for example directly upgrade to a new > POI version without waiting for a releated Tika release first. > Similarly, a new PDF parsing option or improvement could be > implemented directly in PDFBox and be usable without any code changes > in Tika. > > The classloading and OSGi service mechanisms we've added should make > such upstream Parser implementations trivially easy to use, and we > could still keep the dependencies in tika-parsers as a way to pull in > the libraries even if the relevant implementation classes would no > longer reside in org.apache.tika.parsers.*. > > In addition to some of the GPL libraries for which we've already done > this, I recently took the liberty of trying this out also with PDFBox. > See PDFBOX-1132 [1] for the issue where I copied the > org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works > without problems, so now I'd like to propose that we copy any more > recent PDF parser changes to PDFBox and prepare to drop the parser > implementation in tika-parsers. Any further PDF parser work should > then be done directly in PDFBox. I haven't yet talked about this with > the PDFBox PMC (of which I'm a member), but I suppose we should be > able to come up with an arrangement where Tika committers can commit > directly to the Tika parser implementation in PDFBox. > > It would be cool if we could do the same thing also with POI. > > WDYT? > > [1] https://issues.apache.org/jira/browse/PDFBOX-1132 > > BR, > > Jukka Zitting ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++