Re: [DISCUSS] Making storage-api a separately released artifact

Matthew McCline Fri, 26 Aug 2016 13:16:08 -0700

For good performance the VectorizedRowBatch doesn't follow "traditional" good 
object rules -- for better or worse.  We made a number of member variables 
public so they can be accessed directly (e.g. for LongColumnVector the long[] 
vector is public) and avoid using an interface for faster direct object access 
to the ColumnVector family.


________________________________________
From: Sergio Pena <sergio.p...@cloudera.com>
Sent: Friday, August 26, 2016 12:58 PM
To: dev
Subject: Re: [DISCUSS] Making storage-api a separately released artifact

Question:

Wouldn't be better to move part of the implementations to Orc, Parquet and
Avro, and just have some interfaces and basic implementations on Hive? This
way we could avoid Orc, Parquet and/or Avro to depend from Hive. I saw this
on Parquet where they created a RowBatch class internally and returns that
to Hive, then in Hive we will just bind it to the Hive vectorized interface
to support vectorization. It justs an idea, I am not clear exactly what I
am trying to say :)


On Fri, Aug 19, 2016 at 11:01 PM, Lefty Leverenz <leftylever...@gmail.com>
wrote:

> Sergey's idea is creative, although it leads to confusion about JIRA fix
> versions.  Issues would be given fix versions based on assumptions about
> whether SA or Hive will be released first.  (That's hard to predict when
> it's months away.)
>
> Keeping the version numbers tied together is very appealing.  Would it be
> possible to have incompatible changes in SA force a bump in the Hive
> release number?  Hm, I guess that means Hive would need a release at the
> same time as SA, but only for incompatible changes.
>
> What's the likelihood of another subproject getting spun off eventually?
> If that happened, the 4th minor version wouldn't make sense.  A 5th minor
> version wouldn't work either.
>
> -- Lefty
>
>
> On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin <ser...@hortonworks.com>
> wrote:
>
> > I am suggesting we always skip the number. So only one component gets the
> > next one :) In your example Hive trunk would be 2.3, and if SA is
> released
> > again it would become 2.4. Otherwise we’d need a compat table cause
> > versions will be totally out of sync.
> >
> > On 16/8/19, 16:31, "Owen O'Malley" <omal...@apache.org> wrote:
> >
> > >That won't necessarily work, especially in the beginning. If we release
> SA
> > >2.2.0 and use it for Hive trunk with the assumption that the next Hive
> > >release will be 2.2. What do we do when we need to make an incompatible
> > >change in SA? I guess we could release SA as 2.3.0 and when hive makes
> its
> > >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In
> general
> > >I
> > >think that we'd be better off with the release numbers not tied
> together.
> > >
> > >.. Owen
> > >
> > >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin <
> ser...@hortonworks.com
> > >
> > >wrote:
> > >
> > >> Can we just run the versions thru? I.e. increment it every time but
> > >> release only one component (or both if they happen to align I guess).
> > >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves
> fast,
> > >> then Hive 2.4, then storage-api 2.5, etc.
> > >> That might make it easier to reason about compatibility because the
> > >>order
> > >> is obvious.
> > >>
> > >> On 16/8/19, 09:04, "Sergio Pena" <sergio.p...@cloudera.com> wrote:
> > >>
> > >> >I see Parquet is currently using the SearchArgument class for
> > >>predicates
> > >> >push down.
> > >> >Will this class be part of the new sub-module or project?
> > >> >
> > >> >Following Sushanth idea, can we have other API interfaces in the new
> > >> >project that other components can use?
> > >> >Perhaps having this may be a good reason to create a project.
> > >> >
> > >> >I'm -1 with the 4th minor version. As Owen mentioned, changing the
> 4th
> > >> >version number for incompatible changes is ugly and confusing.
> > >> >I like the new project idea more, +1, but  the storage-api may be too
> > >> >small
> > >> >for a new project.
> > >> >
> > >> >- Sergio
> > >> >
> > >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <omal...@apache.org>
> > >> wrote:
> > >> >
> > >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfga...@gmail.com
> >
> > >> >>wrote:
> > >> >>
> > >> >> > +1 for making the API clean and easy for other projects to work
> > >>with.
> > >> >> A
> > >> >> > few questions:
> > >> >> >
> > >> >> > 1) Would this also make it easier for Parquet and others to
> > >>implement
> > >> >> > Hive’s ACID interfaces?
> > >> >> >
> > >> >>
> > >> >> Currently the ACID interfaces haven't been moved over to
> storage-api,
> > >> >> although it would make sense to do so at some point.
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > 2) Would we make any attempt to coordinate version numbers
> between
> > >> >>Hive
> > >> >> > and the storage module, or would a given version of Hive just
> > >>depend
> > >> >>on a
> > >> >> > given version of the storage module?
> > >> >> >
> > >> >>
> > >> >> The two options that I see are:
> > >> >>
> > >> >> * Let the numbers run separately starting from 2.2.0.
> > >> >> * Tie the numbers together with an additional level of versioning
> > >>(eg.
> > >> >> 2.2.0.0).
> > >> >>
> > >> >> I think that letting the two version numbers diverge is better in
> the
> > >> >>long
> > >> >> term. For example, if you need to make an incompatible change, it
> is
> > >> >>pretty
> > >> >> ugly to do it as a fourth level version number (eg. an incompatible
> > >> >>change
> > >> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that
> storage-api
> > >> >>would
> > >> >> move faster than Hive, but as it stabilizes I expect it might start
> > >> >>moving
> > >> >> slower than Hive.
> > >> >>
> > >> >> I'd propose that we have Hive's build use a released version of
> > >> >>storage-api
> > >> >> rather than a snapshot.
> > >> >>
> > >> >> Thoughts?
> > >> >>
> > >> >>    Owen
> > >> >>
> > >> >>
> > >> >> > Alan.
> > >> >> >
> > >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <omal...@apache.org>
> > >> wrote:
> > >> >> > >
> > >> >> > > All,
> > >> >> > >
> > >> >> > > As part of moving ORC out of Hive, we pulled all of the
> > >> >>vectorization
> > >> >> > > storage and sarg classes into a separate module, which is named
> > >> >> > > storage-api.  Although it is currently only used by ORC, it
> > >>could be
> > >> >> used
> > >> >> > > by Parquet or Avro if they wanted to make a fast vectorized
> > >>reader
> > >> >>that
> > >> >> > > read directly in to Hive's VectorizedRowBatch without needing a
> > >> >>shim or
> > >> >> > > data copy. Note that this is in many ways similar to pulling
> the
> > >> >>Arrow
> > >> >> > > project out of Drill.
> > >> >> > >
> > >> >> > > This unfortunately still leaves us with a circular dependency
> > >> >>between
> > >> >> > Hive
> > >> >> > > and ORC. I'd hoped that storage-api wouldn't change that much,
> > >>but
> > >> >>that
> > >> >> > > doesn't seem to be happening. As a result, ORC ends up shipping
> > >>its
> > >> >>own
> > >> >> > > fork of storage-api.
> > >> >> > >
> > >> >> > > Although we could make a new project for just the storage-api,
> I
> > >> >>think
> > >> >> it
> > >> >> > > would be better to make it a subproject of Hive that is
> released
> > >> >> > > independently.
> > >> >> > >
> > >> >> > > What do others think?
> > >> >> > >
> > >> >> > >   Owen
> > >> >> >
> > >> >> >
> > >> >>
> > >>
> > >>
> >
> >
>

Re: [DISCUSS] Making storage-api a separately released artifact

Reply via email to