For good performance the VectorizedRowBatch doesn't follow "traditional" good object rules -- for better or worse. We made a number of member variables public so they can be accessed directly (e.g. for LongColumnVector the long[] vector is public) and avoid using an interface for faster direct object access to the ColumnVector family.
________________________________________ From: Sergio Pena <sergio.p...@cloudera.com> Sent: Friday, August 26, 2016 12:58 PM To: dev Subject: Re: [DISCUSS] Making storage-api a separately released artifact Question: Wouldn't be better to move part of the implementations to Orc, Parquet and Avro, and just have some interfaces and basic implementations on Hive? This way we could avoid Orc, Parquet and/or Avro to depend from Hive. I saw this on Parquet where they created a RowBatch class internally and returns that to Hive, then in Hive we will just bind it to the Hive vectorized interface to support vectorization. It justs an idea, I am not clear exactly what I am trying to say :) On Fri, Aug 19, 2016 at 11:01 PM, Lefty Leverenz <leftylever...@gmail.com> wrote: > Sergey's idea is creative, although it leads to confusion about JIRA fix > versions. Issues would be given fix versions based on assumptions about > whether SA or Hive will be released first. (That's hard to predict when > it's months away.) > > Keeping the version numbers tied together is very appealing. Would it be > possible to have incompatible changes in SA force a bump in the Hive > release number? Hm, I guess that means Hive would need a release at the > same time as SA, but only for incompatible changes. > > What's the likelihood of another subproject getting spun off eventually? > If that happened, the 4th minor version wouldn't make sense. A 5th minor > version wouldn't work either. > > -- Lefty > > > On Fri, Aug 19, 2016 at 9:46 PM, Sergey Shelukhin <ser...@hortonworks.com> > wrote: > > > I am suggesting we always skip the number. So only one component gets the > > next one :) In your example Hive trunk would be 2.3, and if SA is > released > > again it would become 2.4. Otherwise we’d need a compat table cause > > versions will be totally out of sync. > > > > On 16/8/19, 16:31, "Owen O'Malley" <omal...@apache.org> wrote: > > > > >That won't necessarily work, especially in the beginning. If we release > SA > > >2.2.0 and use it for Hive trunk with the assumption that the next Hive > > >release will be 2.2. What do we do when we need to make an incompatible > > >change in SA? I guess we could release SA as 2.3.0 and when hive makes > its > > >next release skip over Hive 2.2 and go straight to Hive 2.3.0. In > general > > >I > > >think that we'd be better off with the release numbers not tied > together. > > > > > >.. Owen > > > > > >On Fri, Aug 19, 2016 at 4:14 PM, Sergey Shelukhin < > ser...@hortonworks.com > > > > > >wrote: > > > > > >> Can we just run the versions thru? I.e. increment it every time but > > >> release only one component (or both if they happen to align I guess). > > >> E.g. storage-api will be released at 2.2, and say 2.3 if it moves > fast, > > >> then Hive 2.4, then storage-api 2.5, etc. > > >> That might make it easier to reason about compatibility because the > > >>order > > >> is obvious. > > >> > > >> On 16/8/19, 09:04, "Sergio Pena" <sergio.p...@cloudera.com> wrote: > > >> > > >> >I see Parquet is currently using the SearchArgument class for > > >>predicates > > >> >push down. > > >> >Will this class be part of the new sub-module or project? > > >> > > > >> >Following Sushanth idea, can we have other API interfaces in the new > > >> >project that other components can use? > > >> >Perhaps having this may be a good reason to create a project. > > >> > > > >> >I'm -1 with the 4th minor version. As Owen mentioned, changing the > 4th > > >> >version number for incompatible changes is ugly and confusing. > > >> >I like the new project idea more, +1, but the storage-api may be too > > >> >small > > >> >for a new project. > > >> > > > >> >- Sergio > > >> > > > >> >On Wed, Aug 17, 2016 at 2:05 PM, Owen O'Malley <omal...@apache.org> > > >> wrote: > > >> > > > >> >> On Wed, Aug 17, 2016 at 10:46 AM, Alan Gates <alanfga...@gmail.com > > > > >> >>wrote: > > >> >> > > >> >> > +1 for making the API clean and easy for other projects to work > > >>with. > > >> >> A > > >> >> > few questions: > > >> >> > > > >> >> > 1) Would this also make it easier for Parquet and others to > > >>implement > > >> >> > Hive’s ACID interfaces? > > >> >> > > > >> >> > > >> >> Currently the ACID interfaces haven't been moved over to > storage-api, > > >> >> although it would make sense to do so at some point. > > >> >> > > >> >> > > >> >> > > > >> >> > 2) Would we make any attempt to coordinate version numbers > between > > >> >>Hive > > >> >> > and the storage module, or would a given version of Hive just > > >>depend > > >> >>on a > > >> >> > given version of the storage module? > > >> >> > > > >> >> > > >> >> The two options that I see are: > > >> >> > > >> >> * Let the numbers run separately starting from 2.2.0. > > >> >> * Tie the numbers together with an additional level of versioning > > >>(eg. > > >> >> 2.2.0.0). > > >> >> > > >> >> I think that letting the two version numbers diverge is better in > the > > >> >>long > > >> >> term. For example, if you need to make an incompatible change, it > is > > >> >>pretty > > >> >> ugly to do it as a fourth level version number (eg. an incompatible > > >> >>change > > >> >> from 2.2.0.0 to 2.2.0.1). At the beginning, I expect that > storage-api > > >> >>would > > >> >> move faster than Hive, but as it stabilizes I expect it might start > > >> >>moving > > >> >> slower than Hive. > > >> >> > > >> >> I'd propose that we have Hive's build use a released version of > > >> >>storage-api > > >> >> rather than a snapshot. > > >> >> > > >> >> Thoughts? > > >> >> > > >> >> Owen > > >> >> > > >> >> > > >> >> > Alan. > > >> >> > > > >> >> > > On Aug 15, 2016, at 17:01, Owen O'Malley <omal...@apache.org> > > >> wrote: > > >> >> > > > > >> >> > > All, > > >> >> > > > > >> >> > > As part of moving ORC out of Hive, we pulled all of the > > >> >>vectorization > > >> >> > > storage and sarg classes into a separate module, which is named > > >> >> > > storage-api. Although it is currently only used by ORC, it > > >>could be > > >> >> used > > >> >> > > by Parquet or Avro if they wanted to make a fast vectorized > > >>reader > > >> >>that > > >> >> > > read directly in to Hive's VectorizedRowBatch without needing a > > >> >>shim or > > >> >> > > data copy. Note that this is in many ways similar to pulling > the > > >> >>Arrow > > >> >> > > project out of Drill. > > >> >> > > > > >> >> > > This unfortunately still leaves us with a circular dependency > > >> >>between > > >> >> > Hive > > >> >> > > and ORC. I'd hoped that storage-api wouldn't change that much, > > >>but > > >> >>that > > >> >> > > doesn't seem to be happening. As a result, ORC ends up shipping > > >>its > > >> >>own > > >> >> > > fork of storage-api. > > >> >> > > > > >> >> > > Although we could make a new project for just the storage-api, > I > > >> >>think > > >> >> it > > >> >> > > would be better to make it a subproject of Hive that is > released > > >> >> > > independently. > > >> >> > > > > >> >> > > What do others think? > > >> >> > > > > >> >> > > Owen > > >> >> > > > >> >> > > > >> >> > > >> > > >> > > > > >