The patch that I have put up https://github.com/apache/arrow/pull/1198
seems to be in a reasonable state. We are now working off a different
branch "java vector refactor".

Now that we have the basic structure,  in order to make quick forward
progress, I would like to go ahead and do for other types (FLOAT, BIGINT
etc), list, map and create their legacy
counter parts -- doing them in subsequent patches is requiring me to write
some duplicate code and redundant if conditions in code that expects all
the vectors to have mutator/accessor.

Is that fine? Just wanted to check with people and ensure there aren't any
major concerns.

The feedback on the PR (original one for master
https://github.com/apache/arrow/pull/1164) has been really good -- some of
the comments are yet to be addressed and we jointly decided to address few
things (like Minor Type etc) after the refactoring has been done.

On the testing front, as far as the correctness is concerned, I have two
failures in TestArrowFile and TestValueVector. I have added some more tests
too.





On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia <siddha...@dremio.com>
wrote:

> Yes, that is the intention. Good that we all are on the same page. I will
> move the PR https://github.com/apache/arrow/pull/1164 to new branch.
>
> On Thu, Oct 12, 2017 at 11:20 AM, Li Jin <ice.xell...@gmail.com> wrote:
>
>> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
>> deprecated API.
>>
>> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin <ice.xell...@gmail.com> wrote:
>>
>> > Siddharth,
>> >
>> > For working off a branch, Wes has created https://github.com/apache/
>> > arrow/tree/java-vector-refactor that we can submit PR to.
>> >
>> > For Legacy vectors, I think it's fine because it's really just a
>> migration
>> > path to help Dremio to migrate to the new vectors. I don't think other
>> > users, i.e., Spark will use the Legacy vector class. Bryan and I will
>> just
>> > migrate Spark to new vectors directly because Spark's use of Arrow is
>> very
>> > simple.
>> >
>> >
>> >
>> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia <siddha...@dremio.com
>> >
>> > wrote:
>> >
>> >> Thanks Bryan and Li.
>> >>
>> >> Yes, the goal is to get this (and the subsequent patches) merged to the
>> >> new
>> >> branch. Once it is stabilized from different aspects, we can move to
>> >> master. I am not sure of the exact mechanics when we work off a
>> different
>> >> project branch and not master.
>> >>
>> >> Does that sound good?
>> >>
>> >> Regarding compatibility, are we suggesting that let's not create Legacy
>> >> Nullable vectors at all? The initial thoughts were to generate Legacy
>> >> vectors from NullableValueVectors template and these vectors are
>> >> mutator/accessor based (in today's world). Internally each operation
>> will
>> >> be delegated to new vectors (non code generated).
>> >>
>> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler <cutl...@gmail.com>
>> wrote:
>> >>
>> >> > Thanks for the update Siddharth.  From the Spark side of this, I
>> >> definitely
>> >> > want to try to upgrade to the latest Arrow before the Spark 2.3
>> release
>> >> but
>> >> > if it the refactor is too disruptive then others might get squeamish
>> >> about
>> >> > upgrading.  On the other hand, I don't think we should hold back on
>> >> > refactoring for compatibility sake and the way it's looking now
>> trying
>> >> to
>> >> > be backwards-compatible will be too much of a pain.  I will try to
>> >> figure
>> >> > out the timeline for Spark 2.3 and what the feeling is for upgrading
>> >> > Arrow.  Can we hold off on merging this to master for now and just
>> work
>> >> out
>> >> > of the separate branch until we can get a better feeling for the
>> impact?
>> >> >
>> >> > Bryan
>> >> >
>> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin <ice.xell...@gmail.com>
>> wrote:
>> >> >
>> >> > > Hi Siddharth,
>> >> > >
>> >> > > Thanks for the update. This looks good.
>> >> > >
>> >> > > A few thoughts:
>> >> > >
>> >> > > *Compatibility:*
>> >> > > It sounds like we will introduce some back-compatibility with the
>> new
>> >> > > Vector class. At this point I think our main Java users should be
>> >> Spark
>> >> > and
>> >> > > Dremio, is this right?
>> >> > >
>> >> > >
>> >> > >    - For Spark:
>> >> > >
>> >> > > It seems fine since Spark uses just the basic functionality of
>> Vector
>> >> > > classes and the existing code should work with the new Vector
>> classes,
>> >> > > maybe even without any code change on the Spark side.
>> >> > >
>> >> > >
>> >> > >    - For Dremio:
>> >> > >
>> >> > > Sounds like you are already taking care of this by introducing the
>> >> > > LegacyVector classes.
>> >> > >
>> >> > >
>> >> > > *Testing:*
>> >> > >
>> >> > >    - Spark Integration Tests:
>> >> > >
>> >> > > Bryan and I can help with integration test with Spark. I think the
>> >> target
>> >> > > timeline for Spark 2.3 release is some time in mid Nov (Bryan
>> please
>> >> > > correct me if I am wrong).
>> >> > >
>> >> > > I will take a look at the PR today.
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Tue, Oct 10, 2017 at 4:29 PM, Siddharth Teotia <
>> >> siddha...@dremio.com>
>> >> > > wrote:
>> >> > >
>> >> > > > Hi All,
>> >> > > >
>> >> > > > I wanted to update everyone on state of this mini-project:
>> >> > > >
>> >> > > >
>> >> > > >    - Requirements document and initial design proposal were sent
>> >> out to
>> >> > > the
>> >> > > >    community for review and we have received some good feedback.
>> All
>> >> > > > required
>> >> > > >    docs are attached with corresponding JIRAs.
>> >> > > >
>> >> > > >
>> >> > > >    - The initial prototype is in a reasonable state
>> (code-complete).
>> >> > You
>> >> > > >    can see the PR here - https://github.com/apache/arro
>> w/pull/1164
>> >> > > >
>> >> > > >
>> >> > > >    - The prototype has code changes for the new hierarchy,
>> abstract
>> >> > > >    interfaces for fixed width and variable width vectors and
>> >> concrete
>> >> > > >    implementation of NullableIntVector and NullableVarCharVector.
>> >> > > >
>> >> > > >
>> >> > > > Plan for testing and integrating into existing infrastructure:
>> >> > > >
>> >> > > >
>> >> > > >    - My initial thoughts are that this particular patch will
>> >> require a
>> >> > > lot
>> >> > > >    of testing, reviews etc since the foundation of rest of the
>> >> > > > implementation
>> >> > > >    more or less depends on how the APIs are flushed out here.
>> >> > > >
>> >> > > >
>> >> > > >    - So the goal is to get this properly tested and merged into
>> >> master
>> >> > > >    first.
>> >> > > >
>> >> > > >
>> >> > > >    - The idea is to slowly deprecate and remove the existing
>> >> vectors in
>> >> > > >    stages. In this patch itself, we change the existing
>> >> > > >    NullableValueVectors.java template to generate
>> >> > LegacyNullableIntVector
>> >> > > > and
>> >> > > >    LegacyNullableVarCharVector. Each operation on these vectors
>> will
>> >> > > > delegate
>> >> > > >    to the corresponding NullableIntVector and
>> NullableVarCharVector
>> >> > that
>> >> > > > are
>> >> > > >    newly implemented.
>> >> > > >
>> >> > > >
>> >> > > >    - This achieves two goals w.r.t testing:
>> >> > > >
>> >> > > >
>> >> > > >    - Firstly, our existing JAVA unit tests will automatically
>> >> exercise
>> >> > > the
>> >> > > >       newly written code and its APIs (API names have not
>> changed)
>> >> for
>> >> > > >       NullableInt and NullableVarChar vectors.
>> >> > > >
>> >> > > >
>> >> > > >    - Secondly, let's say we rebase Dremio on top of Arrow master
>> and
>> >> > > >       replace all references to NullableIntVector and
>> >> > > > NullableVarCharVector with
>> >> > > >       their Legacy counterparts, things should still work.
>> >> > > >
>> >> > > >
>> >> > > >    - After this patch gets merged, we can do the following work
>> in
>> >> > > multiple
>> >> > > >    patches:
>> >> > > >       - Write concrete implementations for rest of the nullable
>> >> types
>> >> > --
>> >> > > >       FLOAT4, FLOAT8, BIGINT, VARBINARY etc
>> >> > > >
>> >> > > >
>> >> > > >    - Write additional tests (definitely needed but the first goal
>> >> is to
>> >> > > >       make sure existing tests are not broken).
>> >> > > >
>> >> > > >
>> >> > > >    - Ensure NullableValueVectors template generates Legacy
>> vectors
>> >> and
>> >> > > each
>> >> > > >       operation is merely a delegation to the API in new
>> >> > implementation.
>> >> > > >
>> >> > > >
>> >> > > >    - In the next Arrow release, remove all Legacy vectors and
>> >> > > >       NullableValueVectors template since we will have the
>> >> > implementation
>> >> > > > for
>> >> > > >       each type that passes existing tests.
>> >> > > >
>> >> > > >
>> >> > > >    - I am currently inspecting the newly written code and making
>> >> > changes
>> >> > > to
>> >> > > >       the template to generate Legacy vector types for Nullable
>> Int
>> >> > > > and Nullable
>> >> > > >       VarChar and delegating the operations. The changes should
>> be
>> >> > > > available in
>> >> > > >       the PR in a couple of hours.
>> >> > > >
>> >> > > >
>> >> > > > I am wondering if there are any other ideas around testing,
>> merging
>> >> > etc.
>> >> > > > Please feel free to reply here or comment on the PR.
>> >> > > >
>> >> > > > I would appreciate if people can take time to review the code in
>> PR
>> >> --
>> >> > > > especially the abstract classes BaseNullableFixedWidth and
>> >> > > > BaseNullableVariableWidth. Writing concrete implementations for
>> >> other
>> >> > > types
>> >> > > > will be much less hassle if these abstract classes have proper
>> code.
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Siddharth
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Reply via email to