Re: [DISCUSS] Making storage-api a separately released artifact

Owen O'Malley Wed, 17 Aug 2016 14:05:48 -0700

On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfga...@gmail.com> wrote:

> +1 for making the API clean and easy for other projects to work with.  A
> few questions:
>
> 1) Would this also make it easier for Parquet and others to implement
> Hive’s ACID interfaces?
>

Currently the ACID interfaces haven't been moved over to storage-api,
although it would make sense to do so at some point.

>
> 2) Would we make any attempt to coordinate version numbers between Hive
> and the storage module, or would a given version of Hive just depend on a
> given version of the storage module?
>

The two options that I see are:

* Let the numbers run separately starting from 2.2.0.
* Tie the numbers together with an additional level of versioning (eg.
2.2.0.0).

I think that letting the two version numbers diverge is better in the long
term. For example, if you need to make an incompatible change, it is pretty
ugly to do it as a fourth level version number (eg. an incompatible change
from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that storage-api would
move faster than Hive, but as it stabilizes I expect it might start moving
slower than Hive.

I'd propose that we have Hive's build use a released version of storage-api
rather than a snapshot.

Thoughts?

   Owen

> Alan.
>
> > On Aug 15, 2016, at 17:01, Owen O'Malley <omal...@apache.org> wrote:
> >
> > All,
> >
> > As part of moving ORC out of Hive, we pulled all of the vectorization
> > storage and sarg classes into a separate module, which is named
> > storage-api.  Although it is currently only used by ORC, it could be used
> > by Parquet or Avro if they wanted to make a fast vectorized reader that
> > read directly in to Hive's VectorizedRowBatch without needing a shim or
> > data copy. Note that this is in many ways similar to pulling the Arrow
> > project out of Drill.
> >
> > This unfortunately still leaves us with a circular dependency between
> Hive
> > and ORC. I'd hoped that storage-api wouldn't change that much, but that
> > doesn't seem to be happening. As a result, ORC ends up shipping its own
> > fork of storage-api.
> >
> > Although we could make a new project for just the storage-api, I think it
> > would be better to make it a subproject of Hive that is released
> > independently.
> >
> > What do others think?
> >
> >   Owen
>
>

Re: [DISCUSS] Making storage-api a separately released artifact

Reply via email to