Re: Pushing parsers upstream

2011-12-22 Thread Nick Burch
On 16/12/11 15:12, Jukka Zitting wrote: As mentioned by Antoni, in the end the metadata keys are just strings, so with a little coordination we don't need to delay the introduction of new keys over multiple releases. Hmm, they're not quite just strings - with the new Property stuff they can al

Re: Pushing parsers upstream

2011-12-16 Thread Antoni Mylka
W dniu 2011-12-16 20:32, Jukka Zitting pisze: Hi, On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka wrote: The moment upstream libraries start depending in tika-core, they stop being upstream libraries and become "side-stream" libraries. Putting POI between core and parsers in the dependency chain

Re: Pushing parsers upstream

2011-12-16 Thread Jukka Zitting
Hi, On Fri, Dec 16, 2011 at 8:04 PM, Antoni Mylka wrote: > I don't want to start new flames and understand that the current status quo > is probably the best possible, given all requirements, yet let's not get > carried away about creating yet another ultimate solution. I was just thinking of st

Re: Pushing parsers upstream

2011-12-16 Thread Jukka Zitting
Hi, On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka wrote: > The moment upstream libraries start depending in tika-core, they stop being > upstream libraries and become "side-stream" libraries. Putting POI between > core and parsers in the dependency chain will bring all sorts of issues due > to in

Re: Pushing parsers upstream

2011-12-16 Thread Antoni Mylka
W dniu 2011-12-16 16:12, Jukka Zitting pisze: * Consistency - both or markup and metadata keys will be harder to ensure when it isn't in the same codebase Yep, that can be a problem. I guess the ultimate solution to this would be to come up with a well documented definition of what a parser s

Re: Pushing parsers upstream

2011-12-16 Thread Antoni Mylka
scope "import". In general pushing parsers "upstream" brings: - graceful degradation with missing dependencies - ability to use a later pdfbox without updating tika - "social" benefits of putting that code closer to people who'll know most about how to make

Re: Pushing parsers upstream

2011-12-16 Thread Jukka Zitting
Hi, On Tue, Dec 13, 2011 at 6:05 PM, Michael McCandless wrote: > It's true users could directly upgrade their PDFBox w/owaiting for a > Tika release but I suspect most users don't do that... Currently people don't do that because it's so easy to break things by upgrading a parser library in sync

Re: Pushing parsers upstream

2011-12-16 Thread Jukka Zitting
Hi, On Tue, Dec 13, 2011 at 12:23 PM, Nick Burch wrote: > A couple of issues do spring to mind with this plan: Good points. > * Metadata keys - if a parser enhancement or new feature needs a new >  metadata key, then you end up having to wait for a new tika release to >  get it (so you can add

Re: Pushing parsers upstream

2011-12-13 Thread Antoni Mylka
W dniu 2011-12-13 18:05, Michael McCandless pisze: Would it somehow be possible for Tika to ship an unreleased PDFBox?Or does Maven fully tie our hands here? That's the issue. Would it? AFAIU it's impossible. Tika can only depend on jars in maven central. Is it possible to push a snapshot jar

Re: Pushing parsers upstream

2011-12-13 Thread Michael McCandless
+0 I agree, logically, parsers "belong" with their upstream project,since as that project improves how the document format is cracked,they can also make the matching fixes to Tika's parser.  As long asthere's enough love / advocate / testing for the Tika parser in thatproject... My only concern is

Re: Pushing parsers upstream

2011-12-13 Thread Mattmann, Chris A (388J)
Hey Jukka, For places like POI and PDFBox I think this could definitely work. And then for places where we have Parsers, but aren't ready to push upstream yet (I can think of two examples of this relevant to me, NetCDF/HDF and GDAL), we can just leave the Parser in tika-parsers I think. In this

Re: Pushing parsers upstream

2011-12-13 Thread Antoni Mylka
ot; trunks, sometimes trunks with my patches. See for instance http://aperture.sourceforge.net/maven/org/apache/poi/poi/ This would clearly work for an "internal" project, but didn't work too well for an open source project. It also takes lots of work. With Tika such a solutio

Re: Pushing parsers upstream

2011-12-13 Thread Nick Burch
On Tue, 13 Dec 2011, Jukka Zitting wrote: To avoid this issue I propose that we start moving some of our parser implementations to upstream projects. Now with Tika 1.0 out we have a stable Parser and Detector interfaces and related APIs that upstream libraries could implement directly without u

Pushing parsers upstream

2011-12-13 Thread Jukka Zitting
Hi, As you know, we see a lot of questions about version mismatches (which POI or PDFBox version should go with this Tika version) and there's a long queue of patches that are waiting for new official releases of our upstream dependencies to become available. To avoid this issue I propose that we