I created https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
some time ago to try to track the status of different implementations and various in-flight discussions about columnar format evolution. Can some others take a look at that and perhaps update some sections? I agree with having at least 2 complete implementations and so we have a good amount of implementation shortfall (e.g. delta dictionaries) to address already. On Mon, Mar 18, 2019 at 12:51 AM Paul Taylor <ptay...@apache.org> wrote: > > Hi Jacques, > > > I think we should have two complete implementations. I don't think having > > one feature in C# and Go and another in JavaScript and Rust does justice to > > the project goals. > > Agree 100%. We may already be in this situation with the DictionaryBatch > "isDelta" flag. I haven't checked the C++ in a while so I may be > mistaken, but I think JS is the only impl with support for interleaved > Dictionary/RecordBatches. It'd be good to put a process in place that > helps avoid this in the future. > > > I think Java and C++ should always be complete. They are > > the first two implementations. I believe they are the most complete and > > broadly used/popular (C++ given Python & Pandas integration and Java via > > Spark & Dremio). > No argument here either, though I should mention with the exception of > Tensor messages the JS version is also feature-complete from the > standpoint of the format. > > It's still early in terms of adoption, but we've seen some interest from > the Vega, Jupyter, and Uber Deck.gl projects in either contributing to > or integrating with ArrowJS. > > So while we're certainly not at the level of Spark or Pandas, we may be > poised for wider adoption, and I'd request we take the JS implementation > into account when making format changes. I'm happy to implement new > features and update the integration tests as necessary. > > > Are there specific changes to format/ that have been merged that you > > are concerned about that you feel need to be discussed separately? > The thing that springs to mind is anything to do with 64-bit indexing, > as recently discussed in the sparse matrix thread. IIRC none of the JS > engines presently allow allocating buffers greater than 2GiB. > Limitations in JS shouldn't block other implementations from moving > ahead, but it would be good for the community to come to a consensus on > guidance or workarounds for JS interop when we are in that sort of > situation. > > Thanks, > > Paul > > > On 3/17/19 6:07 PM, Jacques Nadeau wrote: > >> How about "at least two native implementations" instead of > >> "Java and C++"? Now, we have multiple native > >> implementations: > >> > > I think we should have two complete implementations. I don't think having > > one feature in C# and Go and another in JavaScript and Rust does justice to > > the project goals. I think Java and C++ should always be complete. They are > > the first two implementations. I believe they are the most complete and > > broadly used/popular (C++ given Python & Pandas integration and Java via > > Spark & Dremio). This is a compromise between setting a high barrier for > > creation of new features and making sure that we have validated things > > across impls. > > > > Are there specific changes to format/ that have been merged that you > > are concerned about that you feel need to be discussed separately? > > There have been some changes related to serializing tensor metadata > > that are clearly marked as experimental, and they also do not interact > > with the columnar format. > > > > There are several things we've introduced over time that suffered this > > problem. Alignment changes, dictionary encoding, union behavior, interval > > behavior, tensors, unsigned integrations, etc that we've failed to make > > sure we have integration tests for. I've meant to send this email for > > months but saw a couple of recent proposed changes which made me feel like > > we should discuss further. > >