Hi, On Tue, Dec 13, 2011 at 12:23 PM, Nick Burch <nick.bu...@alfresco.com> wrote: > A couple of issues do spring to mind with this plan:
Good points. > * Metadata keys - if a parser enhancement or new feature needs a new > metadata key, then you end up having to wait for a new tika release to > get it (so you can add the code to use it to release) As mentioned by Antoni, in the end the metadata keys are just strings, so with a little coordination we don't need to delay the introduction of new keys over multiple releases. More generally though, I think it would make sense over time to have tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM, etc.) that aren't directly tied to any specific parser or file format. Format-specific keys like the ones we now have in the MSOffice interface would be better kept next to the actual parser implementation. That way, as long as the generic metadata keys in tika-core are more or less complete (i.e. cover all of the key metadata standards), there should be little need for a parser implementation to need changes in the rest of Tika if it wants to introduce a new custom metadata key. > * Consistency - both or markup and metadata keys will be harder to > ensure when it isn't in the same codebase Yep, that can be a problem. I guess the ultimate solution to this would be to come up with a well documented definition of what a parser should ideally output for specific kinds of content, but that's quite a bit of work. A partial solution could be the kind of shared committership model I was proposing. Then a single committer who wants to increase the level of consistency should be able to do so without worrying about karma boundaries. > For detectors, there's extra issue here. At the moment, both the Zip and > OLE2 detectors handle more than just the POI formats, and in the Zip case > rely on code shared between the parsers (poi+keynote) and detector. How > would this work if the container detectors were handed to POI? I guess this would require some level of code duplication, i.e. having a Zip detector in POI that knows about OOXML types, and another in tika-parsers that knows about other types of Zips. > And who's job would it be to test it? That's a general thing actually, how > much testing would need to remain on the Tika side? I'd still have the upstream libraries as dependencies of tika-parsers, and we definitely should continue maintaining a good set of integration tests there. On the other hand we already have many tests that actually test against issues in upstream parser libraries instead of any code in Tika, and I think those tests would be better located in the upstream projects. Ultimately test cases should go with the issues where particular problems or wishes were expressed. > Oh, but I guess this counts as your answer on what I should be doing with my > Ogg Vorbis parser :) :-) Yep, in a way. >From the beginning the idea behind Tika is that we should focus on being a thin integration layer on top of existing parser libraries. The fact that we're now implementing quite a few parsers by ourselves and the large amount of code we use to wrap especially POI and to a lesser degree PDFBox is a bit of a concern to me. We could and should be pushing more of this work to places where it would be useful also to people who aren't using Tika. There are many people who'd likely benefit from for example a good RTF or Ogg Vorbis parser but who don't really need Tika. Being able to get such people to use and contribute to the code we've written would indirectly help also Tika. Attracting such users and contributions is hard if the code lives only inside Tika. Similarly many bits and pieces in especially our bigger parser classes like those for POI and PDFBox would be useful also within the context of the upstream libraries. For example I could easily see the character run handling code in WordExtractor, the sparse sheet capturing and rendering code in ExcelExtractor, or the annotation handling code in PDF2XHTML becoming a more generally applicable part of the upstream libraries. So while having all this code in Tika makes it easy for us to maintain consistency and rapid evolution in Tika, it introduces a barrier to making the work we do useful also to a wider audience, and thus ultimately reduces the rate of useful contributions we can expect. During Tika 0.x I think the tradeoff favored focusing our work on Tika itself, but now with stable 1.0 APIs I think the time may be ripe to start reducing the size of tika-parsers (which has been growing pretty much, see [1]). [1] https://www.ohloh.net/p/tika/analyses/latest BR, Jukka Zitting