Re: Pushing parsers upstream

Jukka Zitting Fri, 16 Dec 2011 07:13:46 -0800

Hi,

On Tue, Dec 13, 2011 at 12:23 PM, Nick Burch <nick.bu...@alfresco.com> wrote:
> A couple of issues do spring to mind with this plan:


Good points.

> * Metadata keys - if a parser enhancement or new feature needs a new
>  metadata key, then you end up having to wait for a new tika release to
>  get it (so you can add the code to use it to release)

As mentioned by Antoni, in the end the metadata keys are just strings,
so with a little coordination we don't need to delay the introduction
of new keys over multiple releases.

More generally though, I think it would make sense over time to have
tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM,
etc.) that aren't directly tied to any specific parser or file format.
Format-specific keys like the ones we now have in the MSOffice
interface would be better kept next to the actual parser
implementation. That way, as long as the generic metadata keys in
tika-core are more or less complete (i.e. cover all of the key
metadata standards), there should be little need for a parser
implementation to need changes in the rest of Tika if it wants to
introduce a new custom metadata key.

> * Consistency - both or markup and metadata keys will be harder to
>  ensure when it isn't in the same codebase

Yep, that can be a problem. I guess the ultimate solution to this
would be to come up with a well documented definition of what a parser
should ideally output for specific kinds of content, but that's quite
a bit of work.

A partial solution could be the kind of shared committership model I
was proposing. Then a single committer who wants to increase the level
of consistency should be able to do so without worrying about karma
boundaries.

> For detectors, there's extra issue here. At the moment, both the Zip and
> OLE2 detectors handle more than just the POI formats, and in the Zip case
> rely on code shared between the parsers (poi+keynote) and detector. How
> would this work if the container detectors were handed to POI?

I guess this would require some level of code duplication, i.e. having
a Zip detector in POI that knows about OOXML types, and another in
tika-parsers that knows about other types of Zips.

> And who's job would it be to test it? That's a general thing actually, how
> much testing would need to remain on the Tika side?

I'd still have the upstream libraries as dependencies of tika-parsers,
and we definitely should continue maintaining a good set of
integration tests there. On the other hand we already have many tests
that actually test against issues in upstream parser libraries instead
of any code in Tika, and I think those tests would be better located
in the upstream projects. Ultimately test cases should go with the
issues where particular problems or wishes were expressed.

> Oh, but I guess this counts as your answer on what I should be doing with my
> Ogg Vorbis parser :)

:-) Yep, in a way.

>From the beginning the idea behind Tika is that we should focus on
being a thin integration layer on top of existing parser libraries.
The fact that we're now implementing quite a few parsers by ourselves
and the large amount of code we use to wrap especially POI and to a
lesser degree PDFBox is a bit of a concern to me. We could and should
be pushing more of this work to places where it would be useful also
to people who aren't using Tika.

There are many people who'd likely benefit from for example a good RTF
or Ogg Vorbis parser but who don't really need Tika. Being able to get
such people to use and contribute to the code we've written would
indirectly help also Tika. Attracting such users and contributions is
hard if the code lives only inside Tika.

Similarly many bits and pieces in especially our bigger parser classes
like those for POI and PDFBox would be useful also within the context
of the upstream libraries. For example I could easily see the
character run handling code in WordExtractor, the sparse sheet
capturing and rendering code in ExcelExtractor, or the annotation
handling code in PDF2XHTML becoming a more generally applicable part
of the upstream libraries.

So while having all this code in Tika makes it easy for us to maintain
consistency and rapid evolution in Tika, it introduces a barrier to
making the work we do useful also to a wider audience, and thus
ultimately reduces the rate of useful contributions we can expect.

During Tika 0.x I think the tradeoff favored focusing our work on Tika
itself, but now with stable 1.0 APIs I think the time may be ripe to
start reducing the size of tika-parsers (which has been growing pretty
much, see [1]).

[1] https://www.ohloh.net/p/tika/analyses/latest

BR,

Jukka Zitting

Re: Pushing parsers upstream

Reply via email to