In C++ at least, I think making Nullable<T> a template (see here: https://github.com/danrobinson/arrow-demo/blob/master/types.h) would make it easier to both define arbitrary classes and to write parsers that take advantage of template specialization.
Re: 2 vs 3 code paths: handling the null-skipping case can be a single line of code at the start of the function (which could be reduced to a macro): if (arr.null_count() == 0) return ALGORITHM_NAME(arr.child_array()); And it seems like good practice anyway to separate the null_count=0 code into a separate function. On Thu, Feb 25, 2016 at 11:25 PM, Wes McKinney <w...@cloudera.com> wrote: > Inline > > On Thu, Feb 25, 2016 at 7:54 PM, Daniel Robinson > <danrobinson...@gmail.com> wrote: > > Thanks; specific responses inline below. > > > > I'm mostly persuaded, though—I never doubted the C++ code will end up > > perfectly optimized. My main interest in this is the type definitions at > > the metadata level, for which I really do think Nullable<T> should be > > considered a nested type like List<T> or Struct<T, U>. > > > > There is little difference in fundamental semantics and no difference > in memory layout, so it would probably be better in the main reference > implementations to choose the more pragmatic option. It's worth noting > that in Parquet metadata, variable-length collections are indeed > nested types. For example, ?List<?int32> could be represented in > Parquet metadata as (so-called 3-level array coding): > > optional group bag > repeated group list > optional int32 item > > I haven't been far enough down the path of using Arrow / ValueVectors > as the Apache Drill team has, so I'd be interested in hearing more of > their perspective on this. > > We have not yet started coming up with the canonical > (cross-implementation) metadata for Arrow IPC / RPC -- I think that > would be helpful to have alongside container classes to use for > transporting and manipulating the physical data. > > > > > On Thu, Feb 25, 2016 at 9:17 PM, Wes McKinney <w...@cloudera.com> wrote: > > > >> Inline responses > >> > >> On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson > >> <danrobinson...@gmail.com> wrote: > >> > Hi Wes, > >> > > >> > Thanks for the response! > >> > > >> > I see the appeal of representing all arrays at the data structure > level > >> as > >> > "nullable," but I think I've convinced myself that it's better to > >> represent > >> > them all as non-nullable, and bolt on "nullability" as a wrapper > type. In > >> > addition to the reasons I mentioned yesterday (such as localizing > logic > >> for > >> > dealing with null values in its own class), I think there's a pretty > >> strong > >> > efficiency case for it. If you don't distinguish between nullable and > >> > non-nullable data structures at the data structure level, you're > >> > essentially hiding useful optimization information from the compiler. > >> > > >> > >> I question whether this optimization information is useful given many > >> of the target workloads (e.g. scan-based analytics). On one code > >> branch (if null_count > 0), a bitmap is examined either 1-bit at a > >> time or using popcount (1 word's worth of values at a time -- this > >> could improve CPU instruction pipelining), in the other branch the > >> bitmap is ignored. > >> > >> > > Makes sense. > > > >> For example, arrays should probably have a simple API for getting a > value > >> > from a specified slot number, something like get(int32_t offset). This > >> > makes it easier to write algorithms that don't have to worry about the > >> > underlying memory structure of the array. If all arrays at the data > >> > structure level are considered "nullable", this code would have to do > an > >> > "if (null_count == 0)" check every time get() is called, which is > >> > inefficient and somewhat inelegant. > >> > > >> > >> I have been expecting that any user of an Arrow C/C++ library would be > >> expected to understand the physical memory layout; with functions to > >> access values made available as a convenience. So I would not want you > >> to call get(i) unless you know that it is not null and not out of > >> bounds. > >> > > > > I included a safe_get function that assumes the value is within bounds, > as > > I discuss below. But I see the case for making the main get() API assume > > that the value is neither null nor out of bounds. > > > > > > > >> > >> > I made a demo showing how I'd envision implementing nullable arrays in > >> C++ > >> > with nested types, and how algorithms on them would work: > >> > https://github.com/danrobinson/arrow-demo > >> > > >> > Right now it's just nullable and primitive types and arrays, but I'll > try > >> > adding list types and builders to see how it compares to the > >> > string-handling code you linked to. > >> > > >> > One key advantage I see of making Nullable<T> its own nested type is > that > >> > the algorithm code can make full use of C++'s template specialization. > >> For > >> > example, here's how I define the versions of the Mean algorithm that > work > >> > on nullable arrays and those that work on non-nullable arrays (see > >> > https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h): > >> > > >> > // happens when you call Mean() on a non-nullable array > >> > template<typename T> > >> > double Mean(Array<T>& arr) { > >> > return (double)Sum(arr) / (arr.length()); > >> > } > >> > > >> > // happens when you call Mean() on a nullable array > >> > template<typename T> > >> > double Mean(Array< Nullable<T> >& nullableArr) { > >> > return (double)Sum(nullableArr) / (nullableArr.length() - > >> > nullableArr.null_count()); > >> > } > >> > > >> > This is only possible if nullability is included in the > >> > data-structure-level type information. > >> > > >> > As you suggested, the code also uses a shortcut when Sum() is called > on a > >> > "nullable" array that has null_count of 0: it just calls Sum() on its > >> child > >> > array (see line 24). I think this, too, is simpler if we aren't > cramming > >> > nullability logic into every array class. > >> > > >> > There are a few other tricks that could be used whether or not you > treat > >> > Nullable as a separate type. For example, we can implement an > overloaded > >> > get() function that takes, in addition to a slot offset, a default > value > >> > that should be returned if that slot is null (see > >> > https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114). > >> This > >> > let me write the main code for the Sum() algorithm only once, since > for > >> its > >> > purposes, 0 is equivalent to null. (It would also presumably work with > >> > Product() and more importantly sorting). > >> > > >> > >> Cool, thanks for making a prototype! Working code is worth 1000 words =) > >> > >> I'm not convinced by the performance argument. For example, this > >> snippet has multiple if-then-else branches (which could possibly throw > >> exceptions) and so would not be advisable to put in the inner loop of > >> algorithms: > >> > >> value_type get(int32_t slotNumber, value_type null_value) { > >> check_bounds(slotNumber); > >> return safe_get(slotNumber, null_value); > >> } > >> > >> "in time critical code, throwing an exception should be the exception, > >> not the rule." > >> http://www.boost.org/community/error_handling.html > >> > >> > > As I mentioned, I included a safe_get() function that doesn't do a bounds > > check, and that lets you pass a default value to be returned if it is > null. > > If inner loops use that, there is only one branch, the null check, and no > > exceptions. (My Sum() code stupidly didn't use it; I've fixed it now.) > If > > they need more complex null-handling logic they could call isNull() on > the > > nullable array then call get() on the child array. > > > > But I appreciate your point that the most optimized algorithms would need > > to understand the memory at a low level. Though I wonder how easy it will > > be to specify such low-level algorithms for recursive algorithms on > > arbitrarily nested data. > > > >> Again, would appreciate any thoughts. > >> > > >> > I definitely agree that the interface for users should be the priority > >> > (after the memory format). My own motivation for thinking about this > is > >> my > >> > interest in making easy APIs for creating or accessing these memory > >> > structures in higher-level dynamically typed languages. > >> > > >> > >> I agree that making Arrow data structures accessible and easy-to-use > >> in dynamic languages is desirable. I don't think that adding such > >> overhead to reference C++ algorithms (which should probably be as > >> close-to-the-metal and performance-oriented at possible, and expect > >> users to use them responsibly) is a good idea, since usability / > >> accessibility generally comes at the expense of CPU cycles and cache > >> efficiency. > >> > >> For example, it would be straightforward to build "airbags", so to > >> speak, in a Python wrapper C extension, since performance is less > >> critical in user APIs oriented at accessibility and safety / > >> error-handling. In dynamic languages this is especially important as > >> bugs in e.g. Python user code should never result in segfaults/core > >> dumps. > >> > >> > > I completely agree with this; didn't mean to suggest there should be > > runtime overhead in the C++ functions that are used in algorithms. (The > > exceptions in the functions were meant never to be thrown, but I see the > > case for leaving them out entirely.) > > > > My main concern was avoiding such unnecessary overhead (such as null > checks > > in a type that shouldn't be nullable). I think there's no downside to > > having *shortcuts *at *compile time*—and, almost as significantly, > simpler > > and safer code—by encoding some information about nullability in the C++ > > type information. But I appreciate that optimized code would be able to > > avoid that. > > If you have both Nullable<T> and T, since the code for Nullable<T> > (with null count == 0) and T will be identical algorithmically, you > have strictly more code to write and maintain (3 code paths instead of > 2) for every algorithm where skipping null checks improves > performance. > > I would also be very worried about code like this appearing: > > Array* CallFOnFoos(Array* arr) { > if (arr->nullable()) { > return F(static_cast<NullableFooArray*>(arr)); > } else { > return F(static_cast<FooArray*>(arr)); > } > } > > My view is that nullability is a metadata concern and mainly serves to > add complexity to algorithms / analytics (because of the computational > equivalence with nullable data without any actual nulls) and C++ > development overhead (a more complex type hierarchy to reason about). > > (In a FP language like Haskell, such type systems do afford more > compile-time safety which make them worthwhile, but this isn't as true > in C++) > > - Wes > > > > > > > > >> cheers, > >> Wes > >> > >> > On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <w...@cloudera.com> > wrote: > >> > > >> >> hi Dan, > >> >> > >> >> Thanks for these thoughts! There's a few separate problems > >> >> > >> >> - Physical memory representation > >> >> - Metadata > >> >> - Implementation container type layering / user API > >> >> > >> >> We aren't expecting any system that uses Arrow to necessarily use one > >> >> of the reference Arrow implementations, but I expect an ecosystem of > >> >> tools will develop around the reference implementations (e.g. Parquet > >> >> or Avro adapters). > >> >> > >> >> In the thread I started last week, I proposed striking this > >> >> distinction and instead relegating REQUIRED (non-nullable) and > >> >> OPTIONAL (nullable) to the domain of metadata. In my opinion, having > >> >> nullability in the container types at all adds much code complexity, > >> >> and the two cases of > >> >> > >> >> - null count == 0 > >> >> - non-nullable > >> >> > >> >> Must be treated as distinct, but in general algorithmically > equivalent > >> >> (e.g. the bitmap need not be allocated, and would be ignored even if > >> >> so in most analytics code requiring knowledge of null-ness). > >> >> > >> >> In trying to come up with functional type systems for representing > the > >> >> semantics of the data, I find it is most useful to start from the > >> >> desired user API (which sufficiently captures the nuances of the > >> >> physical memory layout) and work backwards to an implementation. > >> >> > >> >> I would be interested to see some rough C++ / pseudocode to help > >> >> illustrate how this would play out in downstream user code using > >> >> libarrow and its headers, and internal algorithms in that operate on > >> >> arrow Array instances. I've seen many functional programmers work on > >> >> these data structure + type system problems and end up chasing their > >> >> tails for a very long time. On the C++ side, that code you see in > >> >> apache/arrow is a limited prototype that I haven't used to build any > >> >> applications yet, so it'd be great to iterate on some working > >> >> prototypes of other layerings. For example, we could work on: > >> >> > >> >> > >> >> > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199 > >> >> > >> >> As pseudocode, suppose we had a function: > >> >> > >> >> double Sum(DoubleArray* arr) { > >> >> double total = 0; > >> >> double* values = arr->values(); > >> >> if (arr->null_count() == 0) { > >> >> for (int i = 0; i < arr->length(); ++i) { > >> >> total += values[i]; > >> >> } > >> >> } else { > >> >> for (int i = 0; i < arr->length(); ++i) { > >> >> if (arr->NotNull(i)) { > >> >> total += values[i]; > >> >> } > >> >> } > >> >> } > >> >> return total; > >> >> } > >> >> > >> >> If you had instead NullableDoubleArray and DoubleArray, you might > want > >> >> to generate different functions. This would likely be true for any > >> >> algorithm where nulls are semantically meaningful. > >> >> > >> >> For me at least, the most important aspect of Arrow is the memory > >> >> layout. The purpose of the implementations is to provide a reference > >> >> user API to make it as convenient and fast as possible to create > >> >> Arrow-layout data, manipulate data in-memory, and send/receive data > >> >> via IPC/shared-memory. > >> >> > >> >> - Wes > >> >> > >> >> On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson > >> >> <danrobinson...@gmail.com> wrote: > >> >> > (The below mostly uses terminology and examples from the draft spec > >> >> > <https://github.com/apache/arrow/blob/master/format/Layout.md> and > >> C++ > >> >> > implementation <https://github.com/apache/arrow/tree/master/cpp>). > >> >> > > >> >> > Under the current spec, if I understand it correctly, there are two > >> >> > versions of every type: a nullable version and a non-nullable > version. > >> >> > Whether a type T is nullable is specified by an attribute on that > >> type, > >> >> so > >> >> > declarations look something like T [nullable]: > >> >> > > >> >> > Int32 [non-nullable] > >> >> > > >> >> > Int32 [nullable] > >> >> > > >> >> > > >> >> > In contrast, a variable-length version of type T is declared using > a > >> >> nested > >> >> > or parametric type, List<T>. > >> >> > > >> >> > > >> >> > List [non-nullable] <Int32 [non-nullable]> > >> >> > List [nullable] <Int32 [non-nullable]> > >> >> > List [non-nullable] <Int32 [nullable]> > >> >> > List [nullable] <Int32 [nullable] > > >> >> > > >> >> > Consistent with this, might it make sense to use a nested type, > >> >> > *Nullable<T>*, to declare a nullable version of that type? A slot > in > >> an > >> >> > array of type Nullable<T> would be able to contain either a null > >> value or > >> >> > any value of type T. Using this system, the above examples would > look > >> >> like: > >> >> > > >> >> > Int32 > >> >> > Nullable<Int32> > >> >> > List<Int32> > >> >> > > >> >> > Nullable<List<Int32>> > >> >> > List<Nullable<Int32>> > >> >> > Nullable<List<Nullable<Int32>>>> > >> >> > > >> >> > > >> >> > This could also be paralleled by a tweak to implementations (which > >> >> wouldn't > >> >> > change how the data is stored in physical memory). Right now, in > the > >> C++ > >> >> > implementation, any array might be nullable, so the base Array > class > >> >> > contains the logic necessary to deal with nullable arrays, > including > >> some > >> >> > potentially unnecessary branching. > >> >> > > >> >> > Instead, much like how an array of type List<T> is instantiated as > a > >> >> > ListArray > >> >> > < > >> >> > >> > https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h > >> >> > > >> >> > that contains its own int32 array of offsets and a reference to a > >> child > >> >> > "values" array of type T > >> >> > <https://github.com/apache/arrow/blob/master/format/Layout.md>, > every > >> >> array > >> >> > of type Nullable<T> could be instantiated as a NullableArray > >> containing > >> >> its > >> >> > own bitmask and a reference to a child array of type T. (It could > also > >> >> keep > >> >> > track of the null_count, if that idea is implemented.) > >> >> > > >> >> > The child array would not have to know that it is wrapped by a > >> >> > NullableArray, and we could abstract almost all of the logic for > >> dealing > >> >> > with null values out of the Array and ArrayBuilder classes. > >> (ArrayBuilder > >> >> > would need a method for incrementing an array's length without > adding > >> a > >> >> > particular value, but it already needs that to build a child array > in > >> a > >> >> Sparse > >> >> > Union > >> >> > < > >> >> > >> > https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png > >> >> > > >> >> > ). > >> >> > > >> >> > I see a few potential additional advantages to this approach: > >> >> > > >> >> > - It would effectively halve the number of distinct types. > >> >> > - As mentioned above, it would be consistent with how the Arrow > >> spec > >> >> > treats List types and other nested types. Treating "nullability" > >> as a > >> >> > special attribute on each type seems more consistent with how > >> >> something > >> >> > like Dremel treats both "optional" and "repeated" fields. > >> >> > - It would also be consistent with how "option types > >> >> > <https://en.wikipedia.org/wiki/Option_type>" are conventionally > >> >> declared > >> >> > in other languages. For example, C# has Nullable<T> > >> >> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia > >> has > >> >> > Nullable{T} > >> >> > < > >> >> > >> > http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values > >> >> >, > >> >> > Java > >> >> > has Optional<T> > >> >> > < > https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html > >> >, > >> >> > and Haskell > >> >> > has Maybe a > >> >> > < > >> >> > https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>. > >> >> > - The implementation would be similar to a special case of > >> >> SparseUnion, > >> >> > which makes sense, since type Nullable<T> is equivalent to T ∪ > >> >> NullType, > >> >> > where NullType is a type that can only hold the null value. > >> >> > - If the spec ever adds a Masking<T> type that allows a > selection > >> of a > >> >> > vector using a bitmask, I imagine it, too, would be implemented > >> >> similarly. > >> >> > - Parsing a type declaration as a tree would be simpler. > >> >> > - Declarations of non-nullable types would be shorter, and > >> >> non-nullable > >> >> > types would be the default. I think this makes sense, since it > >> means > >> >> types > >> >> > such as Int32 correspond exactly to their C-type, and people > would > >> be > >> >> less > >> >> > likely to be surprised by unexpected null values at runtime. > >> >> Additionally, > >> >> > given the overhead (in speed, memory, complexity, and safety) of > >> >> nullable > >> >> > types, I think programmers should have to explicitly declare > them. > >> >> > > >> >> > I'd appreciate any thoughts, especially from those with more > >> involvement > >> >> in > >> >> > the project or experience with related projects. > >> >> > > >> >> > One more thought (which could apply even if nullability is kept as > a > >> >> > special attribute). To make the string representation of nullable > >> types > >> >> > more readable, it might be worth considering a convention from C# > >> >> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift > >> >> > < > >> http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>, > >> >> > Dremel > >> >> > < > >> >> > >> > http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf > >> >> >, > >> >> > and Flow <http://flowtype.org/docs/nullable-types.html>: using > either > >> >> T? or > >> >> > ?T to mean Nullable<T>. > >> >> > > >> >> > Int32 > >> >> > ?Int32 > >> >> > List<Int32> > >> >> > ?List<Int32> > >> >> > List<?Int32> > >> >> > ?List<?Int32> > >> >> > > >> >> > Int32 > >> >> > Int32? > >> >> > List<Int32> > >> >> > List?<Int32> > >> >> > List<Int32?> > >> >> > List?<Int32?> > >> >> > >> >> My inclination is still to remove nullable from the C++ array > >> >> container types, but to keep track of the required (non-nullable) / > >> >> optional (nullable) bit in the type metadata so that it can be kept > >> >> track of in IPC / schema exchange with other systems as necessary. > >> >> > >> >