It is fine to have not-completely-working states in the refactor branch. I recommend do whatever is the most expedient thing to help with making progress.
- Wes On Fri, Oct 13, 2017 at 5:42 PM, Siddharth Teotia <siddha...@dremio.com> wrote: > Li, > > I think there is some confusion. Are you suggesting merging into "java > vector refactor" branch or the master? Is it fine to merge stuff on the > former branch even though few things are broken (around 10 tests) ? If this > is allowed, I can do some cleanup (some documentation, some TODOs suggested > by you and Brian) and we can merge the current patch by EOD or over the > weekend. > > Is this okay? Since we are going to iterate over this branch and not going > to push anything to master until new code is stable, we are probably good. > > Thanks, > Siddhath > > > > On Fri, Oct 13, 2017 at 12:17 PM, Li Jin <ice.xell...@gmail.com> wrote: > >> Siddharth, >> >> Regarding rename: >> Yes this can be done later. >> >> Tests: >> I agree having code like https://github.com/apache/ >> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to >> maintain even temporarily. I am not sure what's the best way to resolve >> test failure wrt removing of the accessor/mutator from the vectors. Maybe >> we can have change the template the create non-accessor/mutator >> getter/setters and also remove acessor/mutator in the test for it to pass? >> What do you think is the easiest? >> >> Reader/Writer: >> Yes we can address this later. >> >> Apologies if I seem to add more work for merging https://github.com/ >> apache/arrow/pull/1164, that's not my intention, I think the PR looks >> good - >> just want to bring up some major design decisions so people can comment and >> discuss. >> >> Li >> >> >> >> >> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia <siddha...@dremio.com> >> wrote: >> >> > I am not quite sure of the need to rename the vectors. Why do we need to >> > rename? This would first require us to remove all the vectors generated >> by >> > FixedValueVectors.java as they are non-nullable scalar vectors. Removing >> > non-nullable vectors is one of the goals, but it can be done once the new >> > infrastructure is properly setup? >> > >> > In order to merge the existing patch, I first need to address some >> (10-15) >> > failures -- few of them are correctness issues w.r.t >> TestVectorUnloadLoad, >> > TestArrowFile and rest all are related to getMutator(), getAccessor() >> > throwing UnsupportedOperationException. This is why I was saying earlier >> > that I will end up doing a lot of rework by writing redundant code where >> > (if vector instanceof NullableInt or vector instanceof NullableVarChar) >> we >> > don't use the mutator/accessor and for other vectors we use it for the >> > current patch. These if conditions are getting complicated with ugly type >> > casting in some parts of the code -- >> > https://github.com/apache/arrow/pull/1164/files#diff- >> > 0876c9a0005d1dbaea321ea8d39d79ae >> > >> > So I thought we can probably implement other vectors (remaining scalars, >> > map and list) where no vector has mutator/accessor and then for every >> > ValueVector, we can remove all calls to getMutator(), getAccessor() as >> > opposed to doing them selectively --- >> > https://github.com/apache/arrow/pull/1164/files#diff- >> > e9273a7b3b35ff7f40f101dc2cf95242 >> > >> > I will try to address these failures by EOD and see if this patch can be >> > merged first. >> > >> > Regarding readers and writers, can we address them subsequently? >> > >> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin <ice.xell...@gmail.com> wrote: >> > >> > > Siddharth, >> > > >> > > Thanks for the update. I think it's fine to move forward with more >> > vectors, >> > > but in the mean time, I think we should also prioritize to merge >> > > https://github.com/apache/arrow/pull/1164, here are a few comments >> needs >> > > to >> > > be addressed. >> > > >> > > (1) Backward-compatibility: >> > > I think there is no way to maintain backward compability as the new >> > vector >> > > classes will be renamed, but want to confirm we are OK with this >> > decision. >> > > We also think the disruption on the Spark side are OK as Spark's use >> case >> > > is simple and Bryan and I can take care of the code change. >> > > >> > > (2) Reader/writer classes: >> > > How does the reader/writer classes interact with the new and legacy >> > vector >> > > classes: >> > > >> > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_ >> > > r144074264 >> > > >> > > My thoughts are: >> > > (1) ArrowReader classes should only return new vector classes >> > > (2) ArrowWriter classes should only work with new vector classes >> > > (3) To read/write legacy vectors, we can use adapters to turn legacy >> > > vectors to new vectors (zero-copy, as the underlying buffers should be >> > > transferred directly) >> > > >> > > Jacques also has a few comments, I don't know if they have been >> > addressed. >> > > >> > > For other comments, I think we can add TODO and do it later. I think we >> > can >> > > merge this PR if we address (1) (2) above. >> > > >> > > Comments? >> > > >> > > >> > > >> > > >> > > >> > > >> > > On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia < >> siddha...@dremio.com >> > > >> > > wrote: >> > > >> > > > The patch that I have put up https://github.com/apache/ >> arrow/pull/1198 >> > > > seems to be in a reasonable state. We are now working off a different >> > > > branch "java vector refactor". >> > > > >> > > > Now that we have the basic structure, in order to make quick forward >> > > > progress, I would like to go ahead and do for other types (FLOAT, >> > BIGINT >> > > > etc), list, map and create their legacy >> > > > counter parts -- doing them in subsequent patches is requiring me to >> > > write >> > > > some duplicate code and redundant if conditions in code that expects >> > all >> > > > the vectors to have mutator/accessor. >> > > > >> > > > Is that fine? Just wanted to check with people and ensure there >> aren't >> > > any >> > > > major concerns. >> > > > >> > > > The feedback on the PR (original one for master >> > > > https://github.com/apache/arrow/pull/1164) has been really good -- >> > some >> > > of >> > > > the comments are yet to be addressed and we jointly decided to >> address >> > > few >> > > > things (like Minor Type etc) after the refactoring has been done. >> > > > >> > > > On the testing front, as far as the correctness is concerned, I have >> > two >> > > > failures in TestArrowFile and TestValueVector. I have added some more >> > > tests >> > > > too. >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia < >> > siddha...@dremio.com> >> > > > wrote: >> > > > >> > > > > Yes, that is the intention. Good that we all are on the same page. >> I >> > > will >> > > > > move the PR https://github.com/apache/arrow/pull/1164 to new >> branch. >> > > > > >> > > > > On Thu, Oct 12, 2017 at 11:20 AM, Li Jin <ice.xell...@gmail.com> >> > > wrote: >> > > > > >> > > > >> To make clear, I think it's fine to have Legacy Vectors in 0.8 as >> a >> > > > >> deprecated API. >> > > > >> >> > > > >> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin <ice.xell...@gmail.com> >> > > wrote: >> > > > >> >> > > > >> > Siddharth, >> > > > >> > >> > > > >> > For working off a branch, Wes has created >> > > https://github.com/apache/ >> > > > >> > arrow/tree/java-vector-refactor that we can submit PR to. >> > > > >> > >> > > > >> > For Legacy vectors, I think it's fine because it's really just a >> > > > >> migration >> > > > >> > path to help Dremio to migrate to the new vectors. I don't think >> > > other >> > > > >> > users, i.e., Spark will use the Legacy vector class. Bryan and I >> > > will >> > > > >> just >> > > > >> > migrate Spark to new vectors directly because Spark's use of >> Arrow >> > > is >> > > > >> very >> > > > >> > simple. >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia < >> > > > siddha...@dremio.com >> > > > >> > >> > > > >> > wrote: >> > > > >> > >> > > > >> >> Thanks Bryan and Li. >> > > > >> >> >> > > > >> >> Yes, the goal is to get this (and the subsequent patches) >> merged >> > to >> > > > the >> > > > >> >> new >> > > > >> >> branch. Once it is stabilized from different aspects, we can >> move >> > > to >> > > > >> >> master. I am not sure of the exact mechanics when we work off a >> > > > >> different >> > > > >> >> project branch and not master. >> > > > >> >> >> > > > >> >> Does that sound good? >> > > > >> >> >> > > > >> >> Regarding compatibility, are we suggesting that let's not >> create >> > > > Legacy >> > > > >> >> Nullable vectors at all? The initial thoughts were to generate >> > > Legacy >> > > > >> >> vectors from NullableValueVectors template and these vectors >> are >> > > > >> >> mutator/accessor based (in today's world). Internally each >> > > operation >> > > > >> will >> > > > >> >> be delegated to new vectors (non code generated). >> > > > >> >> >> > > > >> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler < >> > cutl...@gmail.com> >> > > > >> wrote: >> > > > >> >> >> > > > >> >> > Thanks for the update Siddharth. From the Spark side of >> this, >> > I >> > > > >> >> definitely >> > > > >> >> > want to try to upgrade to the latest Arrow before the Spark >> 2.3 >> > > > >> release >> > > > >> >> but >> > > > >> >> > if it the refactor is too disruptive then others might get >> > > > squeamish >> > > > >> >> about >> > > > >> >> > upgrading. On the other hand, I don't think we should hold >> > back >> > > on >> > > > >> >> > refactoring for compatibility sake and the way it's looking >> now >> > > > >> trying >> > > > >> >> to >> > > > >> >> > be backwards-compatible will be too much of a pain. I will >> try >> > > to >> > > > >> >> figure >> > > > >> >> > out the timeline for Spark 2.3 and what the feeling is for >> > > > upgrading >> > > > >> >> > Arrow. Can we hold off on merging this to master for now and >> > > just >> > > > >> work >> > > > >> >> out >> > > > >> >> > of the separate branch until we can get a better feeling for >> > the >> > > > >> impact? >> > > > >> >> > >> > > > >> >> > Bryan >> > > > >> >> > >> > > > >> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin < >> ice.xell...@gmail.com >> > > >> > > > >> wrote: >> > > > >> >> > >> > > > >> >> > > Hi Siddharth, >> > > > >> >> > > >> > > > >> >> > > Thanks for the update. This looks good. >> > > > >> >> > > >> > > > >> >> > > A few thoughts: >> > > > >> >> > > >> > > > >> >> > > *Compatibility:* >> > > > >> >> > > It sounds like we will introduce some back-compatibility >> with >> > > the >> > > > >> new >> > > > >> >> > > Vector class. At this point I think our main Java users >> > should >> > > be >> > > > >> >> Spark >> > > > >> >> > and >> > > > >> >> > > Dremio, is this right? >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > - For Spark: >> > > > >> >> > > >> > > > >> >> > > It seems fine since Spark uses just the basic functionality >> > of >> > > > >> Vector >> > > > >> >> > > classes and the existing code should work with the new >> Vector >> > > > >> classes, >> > > > >> >> > > maybe even without any code change on the Spark side. >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > - For Dremio: >> > > > >> >> > > >> > > > >> >> > > Sounds like you are already taking care of this by >> > introducing >> > > > the >> > > > >> >> > > LegacyVector classes. >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > *Testing:* >> > > > >> >> > > >> > > > >> >> > > - Spark Integration Tests: >> > > > >> >> > > >> > > > >> >> > > Bryan and I can help with integration test with Spark. I >> > think >> > > > the >> > > > >> >> target >> > > > >> >> > > timeline for Spark 2.3 release is some time in mid Nov >> (Bryan >> > > > >> please >> > > > >> >> > > correct me if I am wrong). >> > > > >> >> > > >> > > > >> >> > > I will take a look at the PR today. >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > On Tue, Oct 10, 2017 at 4:29 PM, Siddharth Teotia < >> > > > >> >> siddha...@dremio.com> >> > > > >> >> > > wrote: >> > > > >> >> > > >> > > > >> >> > > > Hi All, >> > > > >> >> > > > >> > > > >> >> > > > I wanted to update everyone on state of this >> mini-project: >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - Requirements document and initial design proposal >> were >> > > > sent >> > > > >> >> out to >> > > > >> >> > > the >> > > > >> >> > > > community for review and we have received some good >> > > > feedback. >> > > > >> All >> > > > >> >> > > > required >> > > > >> >> > > > docs are attached with corresponding JIRAs. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - The initial prototype is in a reasonable state >> > > > >> (code-complete). >> > > > >> >> > You >> > > > >> >> > > > can see the PR here - https://github.com/apache/arro >> > > > >> w/pull/1164 >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - The prototype has code changes for the new >> hierarchy, >> > > > >> abstract >> > > > >> >> > > > interfaces for fixed width and variable width vectors >> > and >> > > > >> >> concrete >> > > > >> >> > > > implementation of NullableIntVector and >> > > > NullableVarCharVector. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > Plan for testing and integrating into existing >> > > infrastructure: >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - My initial thoughts are that this particular patch >> > will >> > > > >> >> require a >> > > > >> >> > > lot >> > > > >> >> > > > of testing, reviews etc since the foundation of rest >> of >> > > the >> > > > >> >> > > > implementation >> > > > >> >> > > > more or less depends on how the APIs are flushed out >> > here. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - So the goal is to get this properly tested and >> merged >> > > into >> > > > >> >> master >> > > > >> >> > > > first. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - The idea is to slowly deprecate and remove the >> > existing >> > > > >> >> vectors in >> > > > >> >> > > > stages. In this patch itself, we change the existing >> > > > >> >> > > > NullableValueVectors.java template to generate >> > > > >> >> > LegacyNullableIntVector >> > > > >> >> > > > and >> > > > >> >> > > > LegacyNullableVarCharVector. Each operation on these >> > > vectors >> > > > >> will >> > > > >> >> > > > delegate >> > > > >> >> > > > to the corresponding NullableIntVector and >> > > > >> NullableVarCharVector >> > > > >> >> > that >> > > > >> >> > > > are >> > > > >> >> > > > newly implemented. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - This achieves two goals w.r.t testing: >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - Firstly, our existing JAVA unit tests will >> > automatically >> > > > >> >> exercise >> > > > >> >> > > the >> > > > >> >> > > > newly written code and its APIs (API names have not >> > > > >> changed) >> > > > >> >> for >> > > > >> >> > > > NullableInt and NullableVarChar vectors. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - Secondly, let's say we rebase Dremio on top of Arrow >> > > > master >> > > > >> and >> > > > >> >> > > > replace all references to NullableIntVector and >> > > > >> >> > > > NullableVarCharVector with >> > > > >> >> > > > their Legacy counterparts, things should still >> work. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - After this patch gets merged, we can do the >> following >> > > work >> > > > >> in >> > > > >> >> > > multiple >> > > > >> >> > > > patches: >> > > > >> >> > > > - Write concrete implementations for rest of the >> > > nullable >> > > > >> >> types >> > > > >> >> > -- >> > > > >> >> > > > FLOAT4, FLOAT8, BIGINT, VARBINARY etc >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - Write additional tests (definitely needed but the >> > first >> > > > goal >> > > > >> >> is to >> > > > >> >> > > > make sure existing tests are not broken). >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - Ensure NullableValueVectors template generates >> Legacy >> > > > >> vectors >> > > > >> >> and >> > > > >> >> > > each >> > > > >> >> > > > operation is merely a delegation to the API in new >> > > > >> >> > implementation. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - In the next Arrow release, remove all Legacy vectors >> > and >> > > > >> >> > > > NullableValueVectors template since we will have >> the >> > > > >> >> > implementation >> > > > >> >> > > > for >> > > > >> >> > > > each type that passes existing tests. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > - I am currently inspecting the newly written code and >> > > > making >> > > > >> >> > changes >> > > > >> >> > > to >> > > > >> >> > > > the template to generate Legacy vector types for >> > > Nullable >> > > > >> Int >> > > > >> >> > > > and Nullable >> > > > >> >> > > > VarChar and delegating the operations. The changes >> > > should >> > > > >> be >> > > > >> >> > > > available in >> > > > >> >> > > > the PR in a couple of hours. >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > I am wondering if there are any other ideas around >> testing, >> > > > >> merging >> > > > >> >> > etc. >> > > > >> >> > > > Please feel free to reply here or comment on the PR. >> > > > >> >> > > > >> > > > >> >> > > > I would appreciate if people can take time to review the >> > code >> > > > in >> > > > >> PR >> > > > >> >> -- >> > > > >> >> > > > especially the abstract classes BaseNullableFixedWidth >> and >> > > > >> >> > > > BaseNullableVariableWidth. Writing concrete >> implementations >> > > for >> > > > >> >> other >> > > > >> >> > > types >> > > > >> >> > > > will be much less hassle if these abstract classes have >> > > proper >> > > > >> code. >> > > > >> >> > > > >> > > > >> >> > > > Thanks, >> > > > >> >> > > > Siddharth >> > > > >> >> > > > >> > > > >> >> > > >> > > > >> >> > >> > > > >> >> >> > > > >> > >> > > > >> > >> > > > >> >> > > > > >> > > > > >> > > > >> > > >> > >>