Siddharth, Regarding rename: Yes this can be done later.
Tests: I agree having code like https://github.com/apache/ arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to maintain even temporarily. I am not sure what's the best way to resolve test failure wrt removing of the accessor/mutator from the vectors. Maybe we can have change the template the create non-accessor/mutator getter/setters and also remove acessor/mutator in the test for it to pass? What do you think is the easiest? Reader/Writer: Yes we can address this later. Apologies if I seem to add more work for merging https://github.com/ apache/arrow/pull/1164, that's not my intention, I think the PR looks good - just want to bring up some major design decisions so people can comment and discuss. Li On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia <siddha...@dremio.com> wrote: > I am not quite sure of the need to rename the vectors. Why do we need to > rename? This would first require us to remove all the vectors generated by > FixedValueVectors.java as they are non-nullable scalar vectors. Removing > non-nullable vectors is one of the goals, but it can be done once the new > infrastructure is properly setup? > > In order to merge the existing patch, I first need to address some (10-15) > failures -- few of them are correctness issues w.r.t TestVectorUnloadLoad, > TestArrowFile and rest all are related to getMutator(), getAccessor() > throwing UnsupportedOperationException. This is why I was saying earlier > that I will end up doing a lot of rework by writing redundant code where > (if vector instanceof NullableInt or vector instanceof NullableVarChar) we > don't use the mutator/accessor and for other vectors we use it for the > current patch. These if conditions are getting complicated with ugly type > casting in some parts of the code -- > https://github.com/apache/arrow/pull/1164/files#diff- > 0876c9a0005d1dbaea321ea8d39d79ae > > So I thought we can probably implement other vectors (remaining scalars, > map and list) where no vector has mutator/accessor and then for every > ValueVector, we can remove all calls to getMutator(), getAccessor() as > opposed to doing them selectively --- > https://github.com/apache/arrow/pull/1164/files#diff- > e9273a7b3b35ff7f40f101dc2cf95242 > > I will try to address these failures by EOD and see if this patch can be > merged first. > > Regarding readers and writers, can we address them subsequently? > > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin <ice.xell...@gmail.com> wrote: > > > Siddharth, > > > > Thanks for the update. I think it's fine to move forward with more > vectors, > > but in the mean time, I think we should also prioritize to merge > > https://github.com/apache/arrow/pull/1164, here are a few comments needs > > to > > be addressed. > > > > (1) Backward-compatibility: > > I think there is no way to maintain backward compability as the new > vector > > classes will be renamed, but want to confirm we are OK with this > decision. > > We also think the disruption on the Spark side are OK as Spark's use case > > is simple and Bryan and I can take care of the code change. > > > > (2) Reader/writer classes: > > How does the reader/writer classes interact with the new and legacy > vector > > classes: > > > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_ > > r144074264 > > > > My thoughts are: > > (1) ArrowReader classes should only return new vector classes > > (2) ArrowWriter classes should only work with new vector classes > > (3) To read/write legacy vectors, we can use adapters to turn legacy > > vectors to new vectors (zero-copy, as the underlying buffers should be > > transferred directly) > > > > Jacques also has a few comments, I don't know if they have been > addressed. > > > > For other comments, I think we can add TODO and do it later. I think we > can > > merge this PR if we address (1) (2) above. > > > > Comments? > > > > > > > > > > > > > > On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia <siddha...@dremio.com > > > > wrote: > > > > > The patch that I have put up https://github.com/apache/arrow/pull/1198 > > > seems to be in a reasonable state. We are now working off a different > > > branch "java vector refactor". > > > > > > Now that we have the basic structure, in order to make quick forward > > > progress, I would like to go ahead and do for other types (FLOAT, > BIGINT > > > etc), list, map and create their legacy > > > counter parts -- doing them in subsequent patches is requiring me to > > write > > > some duplicate code and redundant if conditions in code that expects > all > > > the vectors to have mutator/accessor. > > > > > > Is that fine? Just wanted to check with people and ensure there aren't > > any > > > major concerns. > > > > > > The feedback on the PR (original one for master > > > https://github.com/apache/arrow/pull/1164) has been really good -- > some > > of > > > the comments are yet to be addressed and we jointly decided to address > > few > > > things (like Minor Type etc) after the refactoring has been done. > > > > > > On the testing front, as far as the correctness is concerned, I have > two > > > failures in TestArrowFile and TestValueVector. I have added some more > > tests > > > too. > > > > > > > > > > > > > > > > > > On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia < > siddha...@dremio.com> > > > wrote: > > > > > > > Yes, that is the intention. Good that we all are on the same page. I > > will > > > > move the PR https://github.com/apache/arrow/pull/1164 to new branch. > > > > > > > > On Thu, Oct 12, 2017 at 11:20 AM, Li Jin <ice.xell...@gmail.com> > > wrote: > > > > > > > >> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a > > > >> deprecated API. > > > >> > > > >> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin <ice.xell...@gmail.com> > > wrote: > > > >> > > > >> > Siddharth, > > > >> > > > > >> > For working off a branch, Wes has created > > https://github.com/apache/ > > > >> > arrow/tree/java-vector-refactor that we can submit PR to. > > > >> > > > > >> > For Legacy vectors, I think it's fine because it's really just a > > > >> migration > > > >> > path to help Dremio to migrate to the new vectors. I don't think > > other > > > >> > users, i.e., Spark will use the Legacy vector class. Bryan and I > > will > > > >> just > > > >> > migrate Spark to new vectors directly because Spark's use of Arrow > > is > > > >> very > > > >> > simple. > > > >> > > > > >> > > > > >> > > > > >> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia < > > > siddha...@dremio.com > > > >> > > > > >> > wrote: > > > >> > > > > >> >> Thanks Bryan and Li. > > > >> >> > > > >> >> Yes, the goal is to get this (and the subsequent patches) merged > to > > > the > > > >> >> new > > > >> >> branch. Once it is stabilized from different aspects, we can move > > to > > > >> >> master. I am not sure of the exact mechanics when we work off a > > > >> different > > > >> >> project branch and not master. > > > >> >> > > > >> >> Does that sound good? > > > >> >> > > > >> >> Regarding compatibility, are we suggesting that let's not create > > > Legacy > > > >> >> Nullable vectors at all? The initial thoughts were to generate > > Legacy > > > >> >> vectors from NullableValueVectors template and these vectors are > > > >> >> mutator/accessor based (in today's world). Internally each > > operation > > > >> will > > > >> >> be delegated to new vectors (non code generated). > > > >> >> > > > >> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler < > cutl...@gmail.com> > > > >> wrote: > > > >> >> > > > >> >> > Thanks for the update Siddharth. From the Spark side of this, > I > > > >> >> definitely > > > >> >> > want to try to upgrade to the latest Arrow before the Spark 2.3 > > > >> release > > > >> >> but > > > >> >> > if it the refactor is too disruptive then others might get > > > squeamish > > > >> >> about > > > >> >> > upgrading. On the other hand, I don't think we should hold > back > > on > > > >> >> > refactoring for compatibility sake and the way it's looking now > > > >> trying > > > >> >> to > > > >> >> > be backwards-compatible will be too much of a pain. I will try > > to > > > >> >> figure > > > >> >> > out the timeline for Spark 2.3 and what the feeling is for > > > upgrading > > > >> >> > Arrow. Can we hold off on merging this to master for now and > > just > > > >> work > > > >> >> out > > > >> >> > of the separate branch until we can get a better feeling for > the > > > >> impact? > > > >> >> > > > > >> >> > Bryan > > > >> >> > > > > >> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin <ice.xell...@gmail.com > > > > > >> wrote: > > > >> >> > > > > >> >> > > Hi Siddharth, > > > >> >> > > > > > >> >> > > Thanks for the update. This looks good. > > > >> >> > > > > > >> >> > > A few thoughts: > > > >> >> > > > > > >> >> > > *Compatibility:* > > > >> >> > > It sounds like we will introduce some back-compatibility with > > the > > > >> new > > > >> >> > > Vector class. At this point I think our main Java users > should > > be > > > >> >> Spark > > > >> >> > and > > > >> >> > > Dremio, is this right? > > > >> >> > > > > > >> >> > > > > > >> >> > > - For Spark: > > > >> >> > > > > > >> >> > > It seems fine since Spark uses just the basic functionality > of > > > >> Vector > > > >> >> > > classes and the existing code should work with the new Vector > > > >> classes, > > > >> >> > > maybe even without any code change on the Spark side. > > > >> >> > > > > > >> >> > > > > > >> >> > > - For Dremio: > > > >> >> > > > > > >> >> > > Sounds like you are already taking care of this by > introducing > > > the > > > >> >> > > LegacyVector classes. > > > >> >> > > > > > >> >> > > > > > >> >> > > *Testing:* > > > >> >> > > > > > >> >> > > - Spark Integration Tests: > > > >> >> > > > > > >> >> > > Bryan and I can help with integration test with Spark. I > think > > > the > > > >> >> target > > > >> >> > > timeline for Spark 2.3 release is some time in mid Nov (Bryan > > > >> please > > > >> >> > > correct me if I am wrong). > > > >> >> > > > > > >> >> > > I will take a look at the PR today. > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > On Tue, Oct 10, 2017 at 4:29 PM, Siddharth Teotia < > > > >> >> siddha...@dremio.com> > > > >> >> > > wrote: > > > >> >> > > > > > >> >> > > > Hi All, > > > >> >> > > > > > > >> >> > > > I wanted to update everyone on state of this mini-project: > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - Requirements document and initial design proposal were > > > sent > > > >> >> out to > > > >> >> > > the > > > >> >> > > > community for review and we have received some good > > > feedback. > > > >> All > > > >> >> > > > required > > > >> >> > > > docs are attached with corresponding JIRAs. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - The initial prototype is in a reasonable state > > > >> (code-complete). > > > >> >> > You > > > >> >> > > > can see the PR here - https://github.com/apache/arro > > > >> w/pull/1164 > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - The prototype has code changes for the new hierarchy, > > > >> abstract > > > >> >> > > > interfaces for fixed width and variable width vectors > and > > > >> >> concrete > > > >> >> > > > implementation of NullableIntVector and > > > NullableVarCharVector. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > Plan for testing and integrating into existing > > infrastructure: > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - My initial thoughts are that this particular patch > will > > > >> >> require a > > > >> >> > > lot > > > >> >> > > > of testing, reviews etc since the foundation of rest of > > the > > > >> >> > > > implementation > > > >> >> > > > more or less depends on how the APIs are flushed out > here. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - So the goal is to get this properly tested and merged > > into > > > >> >> master > > > >> >> > > > first. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - The idea is to slowly deprecate and remove the > existing > > > >> >> vectors in > > > >> >> > > > stages. In this patch itself, we change the existing > > > >> >> > > > NullableValueVectors.java template to generate > > > >> >> > LegacyNullableIntVector > > > >> >> > > > and > > > >> >> > > > LegacyNullableVarCharVector. Each operation on these > > vectors > > > >> will > > > >> >> > > > delegate > > > >> >> > > > to the corresponding NullableIntVector and > > > >> NullableVarCharVector > > > >> >> > that > > > >> >> > > > are > > > >> >> > > > newly implemented. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - This achieves two goals w.r.t testing: > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - Firstly, our existing JAVA unit tests will > automatically > > > >> >> exercise > > > >> >> > > the > > > >> >> > > > newly written code and its APIs (API names have not > > > >> changed) > > > >> >> for > > > >> >> > > > NullableInt and NullableVarChar vectors. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - Secondly, let's say we rebase Dremio on top of Arrow > > > master > > > >> and > > > >> >> > > > replace all references to NullableIntVector and > > > >> >> > > > NullableVarCharVector with > > > >> >> > > > their Legacy counterparts, things should still work. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - After this patch gets merged, we can do the following > > work > > > >> in > > > >> >> > > multiple > > > >> >> > > > patches: > > > >> >> > > > - Write concrete implementations for rest of the > > nullable > > > >> >> types > > > >> >> > -- > > > >> >> > > > FLOAT4, FLOAT8, BIGINT, VARBINARY etc > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - Write additional tests (definitely needed but the > first > > > goal > > > >> >> is to > > > >> >> > > > make sure existing tests are not broken). > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - Ensure NullableValueVectors template generates Legacy > > > >> vectors > > > >> >> and > > > >> >> > > each > > > >> >> > > > operation is merely a delegation to the API in new > > > >> >> > implementation. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - In the next Arrow release, remove all Legacy vectors > and > > > >> >> > > > NullableValueVectors template since we will have the > > > >> >> > implementation > > > >> >> > > > for > > > >> >> > > > each type that passes existing tests. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > - I am currently inspecting the newly written code and > > > making > > > >> >> > changes > > > >> >> > > to > > > >> >> > > > the template to generate Legacy vector types for > > Nullable > > > >> Int > > > >> >> > > > and Nullable > > > >> >> > > > VarChar and delegating the operations. The changes > > should > > > >> be > > > >> >> > > > available in > > > >> >> > > > the PR in a couple of hours. > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > I am wondering if there are any other ideas around testing, > > > >> merging > > > >> >> > etc. > > > >> >> > > > Please feel free to reply here or comment on the PR. > > > >> >> > > > > > > >> >> > > > I would appreciate if people can take time to review the > code > > > in > > > >> PR > > > >> >> -- > > > >> >> > > > especially the abstract classes BaseNullableFixedWidth and > > > >> >> > > > BaseNullableVariableWidth. Writing concrete implementations > > for > > > >> >> other > > > >> >> > > types > > > >> >> > > > will be much less hassle if these abstract classes have > > proper > > > >> code. > > > >> >> > > > > > > >> >> > > > Thanks, > > > >> >> > > > Siddharth > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > >