In regards to your question about returning DataType&, one of the goals of the proposal is to pass shared pointers to concrete data type sub-classes in array constructors to increase type safety. If you return a reference in the type() function, you would need to make a copy of the data type to get a shared pointer to pass to the array sub-class constructor. That would be violating one of the design principles you stated earlier, that creating a situation where arrow::DataType objects have to be copied is undesirable.
On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Josh, > > On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <joshuasto...@gmail.com> > wrote: > > I've updated the document with an extra section to include code snippets > > for alternative designs to address concerns raised here on the mailing > list. > > Thanks, I will look in more detail and reply more later > > > > > MSVC 19 2017 supports covariant return types, and the code presented > above > > successfully compiled on gcc.godbolt.org using that compiler. > > > > I would agree that the covariant_{shared, derived}_ptr classes themselves > > are a bit complicated, and I wasn't particularly happy with the fact that > > there's an extra pointer floating around. I've presented two alternatives > > in the document that would eliminate the need for those classes. The > first > > solution makes the Array::type() function pure virtual and has > sub-classes > > implement without covariant return types. Instead, the sub-classes > provide > > a function with a different name that returns the concrete data type > > sub-class (e.g. - ListArray::list_type()). The second solution uses C++ > > name hiding. In that design, the Array::type() function is not virtual > and > > calls through to a pure virtual function called get_type(). The subclass > > would implement get_type() and simply return a shared pointer to a > DataType > > object, but would also name hide Array::type() with its own > implementation > > that returns a shared pointer the sub-class data type. While the name > > hiding keeps the API cleaner (i.e. - you always call type()), I find that > > it's one of the less used and more obscure features of C++ that tends to > > confuse more novice to intermediate C++ programmers. > > I might be missing something, but what would be wrong with having a new > method > > virtual const DataType& concrete_type() const = 0; > > and > > const T& concrete_type() const; > > in subclasses? Then you don't have to write down a static cast yourself > > > > > The design does not suggest removing the ArrayData class, but instead > > removing the type_ member. It sounds like for the ArrayData class you are > > trying to eschew the C++ type hierarchy and move to a more C oriented > > design with type punning. I think the compromise here would be to have > > The classes in array.h are strongly-typed accessor interfaces rather > than primary data structures, at least the way the code is written > right now. > > The objective of internal::ArrayData is to have a canonical POD data > structure that is the same for all arrays and for each concrete Array > subclass to have a standard constructor. It additionally allows > functions to be written or generated that accept and return ArrayData. > This lets us avoid dynamic dispatch already in a number of places. > > The need for ArrayData as it is now arose pretty much right away when > we started writing the Cast kernels in > https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute. > This makes it possible for functions to interact uniformly with an > Array's internal data without necessarily having to specialize on the > subtype. As an example, we have functions like > https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e > 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99 > which are agnostic to the concrete type, which is handled elsewhere. > > My guess is removing the complete type object from ArrayData would > require significant refactoring of arrow/compute. I'm not sure where > the child metadata would be tracked, then (right now it's in > ArrayData::child_data). > > As an aside, I had considered instead of ArrayData having its fields > be on the base Array type. The problem with this is that in internal > library, if you need to manipulate the data in an array (like the > type, null count, or buffers), you have some problems > > * You must work in the domain of concrete subtypes, since constructing > the base Array should not be allowed > * You may end up defining an auxiliary data structure just like > ArrayData, and writing a function to copy the fields of each concrete > Array type into that auxiliary data structure > > I admit that the need for ArrayData in its current form is not obvious > at first glance. In libraries like TensorFlow, things are vastly > simpler because the multidimensional array memory model is so much > simpler -- you can write functions that take "const Tensor&" for any > value type, whereas in Arrow the data structure depends on the runtime > type. > > To summarize my view at the moment (and I will look in more detail to > understand the problems and solutions better), I would like to see > significantly more development progress in arrow/compute and see what > other kinds of problems we run into in that code, which will feature a > great deal of dynamic dispatch / kernel selection based on various > runtime type and hardware information. > > > ArrayData store the type id but not the shared pointer to the DataType. > You > > can take that approach one step further, and make the ArrayData a C > struct > > with its own API that would successfully compile with a C compiler. You > > could then add a C++ wrapper that deals with that class and is used by > the > > various C++ classes in the Array/DataType type hierarchy. The benefit of > > taking that approach is that you can easily inter-operate between C++ and > > C. In other words, a C only arrow API would be using the richer C++ API > > underneath, but the user of the API would only be able to obtain or pass > > ArrayData C structs from/to the API. > > This would be pretty tricky right now because of memory management > (i.e. buffer subclasses). One could effectively implement a virtual > buffer interface in C but then we're reinventing wheels that C++ > provides us. > > > > > This proposal was in no way trying to address the static / compile-time > > problem. It was specifically targeted at making it easier to obtain > > concrete data types in the dynamic schema setting when dealing with > > arbitrary nesting of types. I do like the idea of a compile-time > hierarchy > > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array, > > StaticListArray<StaticInt32Array> >), but that will require a lot of > > template meta-programming. If the covariant_{shared,derived}_ptr classes > > above seem too complicated, I'm pretty sure that a compile-time hierarchy > > will seem immensely more complicated. Bottom line: I would want to see a > > concrete use case before implementing a compile-time type hierarchy with > > templates. > > > > Got it, I agree that the compile-time hierarchy is a separate matter. > > - Wes > > > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <wesmck...@gmail.com> > wrote: > > > >> hi Josh, > >> > >> Thank you for putting together this document. These changes are > >> interesting and worth giving some consideration. Some initial > >> reactions / questions to get a discussion going: > >> > >> * How is MSVC support for covariant return types? It doesn't look like > >> it's supported, which (I'm sorry to say) would be a deal breaker > >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx > >> * I don't think creating a situation where arrow::DataType objects > >> have to be copied is a good idea. Schemas and types will be reused, > >> particular in a streaming data setting > >> * Adding a covariant shared_ptr seems very complicated; it seems the > >> primary benefit will be in code where we would currently have to > >> introduce a dynamic visitor pattern. Could the same benefit be reaped > >> by adding a _new_ virtual base method that returns a pointer or > >> const-ref to the concrete type? This is a less invasive change > >> * I think it is beneficial to have a simple internal data structure > >> (i.e. ArrayData) that describes any Arrow array object > >> > >> From reading this document, I am wondering if it would not be worth > >> thinking about creating a from-scratch set of array and type data > >> structures intended for use in a static / compile-time setting. Since > >> Arrow features "dynamic" schemas (in the sense that Avro is a more > >> dynamic version of Protocol Buffers or Thrift), we inevitably will be > >> dealing with the problem of dynamic dispatch from type information > >> known only at runtime. I agree that it would be useful to eliminate > >> usages of dynamic visitor pattern in the codebase where possible. > >> > >> - Wes > >> > >> On Sat, May 19, 2018 at 12:48 AM, joshuasto...@gmail.com > >> <joshuasto...@gmail.com> wrote: > >> > I've put together a proposal for using covariant return types in > >> Array::type() in the C++ library. I wanted to get some feedback before > >> putting together a PR in case it's too controversial or would require to > >> much re-factoring of the code: > >> > > >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_ > >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing > >> > > >> > It would be nice to use Google Doc's comment feature, but I would > >> imagine it may be safer to memorialize the discussion here on the > mailing > >> list. > >> >