The layering of C++ container classes and algorithms will not have any bearing of the physical memory layout of the arrays. If we were programming in C, a primitive array could be represented something like:
typedef struct array_t { int type; int32_t length; int32_t null_count; const uint8_t* isnull_bitmap; const uint8_t* data; }; If null_count == 0, then isnull_bitmap is permitted to be NULL. What I am arguing is that instead of having nullability be property of the data structure, systems that require it can track it in the schema metadata (but an Arrow implementation could largely ignore it if so desired; note that many data processing systems -- all the ones I work with actively -- only deal in nullable data). typedef struct array_schema_t { const char* name; int type; bool nullable; }; At metadata / data interchange boundaries, applications can verify that physical data meets the requirements of the metadata (hypothetically, if a system tells you it expects non-nullable int32, we would not send it data containing nulls). >From a C++ code factoring point-of-view, introducing a nullable template "box" has many more cons than pros. Extensions in Python and R wrapping the C++ library would be much more work to develop, too. Though that's a separate issue from the question of how to handle non-nullable schema metadata in general. - Wes On Fri, Feb 26, 2016 at 7:50 PM, Leif Walsh <leif.wa...@gmail.com> wrote: > In the abstract (since I haven't written any code), let me see if I can > make an argument for considering "nullable int" and "int" to both be > worthwhile "primitive" types, as opposed to "Nullable<int>" being a > constructed type over the primitive type "int", in the C++ arena. > > Let's assume Arrow's use case is to manage arrays of numbers, i.e. > Array<number_t>. We have two choices for nullability: > > Array<nullable_int_t> > Array<Nullable<int_t>> > > I think what we want from the data structure is for an array of nullable > ints (ignoring special cases like an array of nullable ints where none of > the ints, at runtime, happens to be null) to be laid out in memory as > std::pair<std::vector<bool>, std::vector<int_t>>. We probably don't want > std::vector<std::pair<boolean, int_t>, because cpus are a Thing (let me > know if this shorthand doesn't make sense, I can elaborate). > > If we define separate nullable primitive types and non-nullable primitive > types, then we can template specialize for the nullable types and factor > out the null bits into their own array. If we require than Nullable be its > own template, it's also possible to specialize the Array template for > Array<Nullable<T>> but I think the template code becomes a lot more > complex. I'd be happy to be proven wrong here, but for now I'll assume > that. > > I don't think we have many other singleton typeclasses like Nullable that > we want to apply to single primitive types. In fact, I can't think of any > others that would be useful. Given that, we're only multiplying the number > of primitive types by two, we're not at risk of exploding the number of > primitive types, and we're probably greatly simplifying the template > implementations of container templates like Array. > > If you can think of other useful primitive templates, or if you can > demonstrate that Array<Nullable<T>> is simple in all languages, I would > change my position on this. > > On Fri, Feb 26, 2016 at 11:06 AM Wes McKinney <w...@cloudera.com> wrote: > >> To paraphrase a great poet, "mo' templates, 'mo problems". I agree >> that some theoretical benefits may be reaped in exchange for >> significantly higher code complexity / likely lower productivity for >> both developers and users of the library. We would need to see >> pragmatic argument why the whole library should be made much more >> complex in exchange compile-time benefits in a small portion of the >> code. >> >> Probably the biggest issue I see with this is the combinatorial >> explosion of generated code. For example, let's consider the array >> function Take(T, Integer) -> T (for example, numpy.ndarray.take). If >> you introduce nullable types, rather than generating one variant for >> each type T and integer type, you need 4: >> >> Take(T, Int) -> T >> Take(T, NullableInt) -> NullableT (indices have nulls) or T (indices >> have no nulls) >> Take(NullableT, Int) -> NullableT >> Take(NullableT, NullableInt) -> NullableT >> >> If you add to this the fact that any nullable index type may not have >> any nulls, you actually have more than 4 branches of logic to >> consider. >> >> In Java, this would be less of a concern, because all functions are >> effectively virtual (dynamic dispatch overhead is something the JIT >> largely takes care of), but in C++ using virtual functions to make the >> arrays more "dynamic" (i.e. using NullableT or T in the same code >> path) would not yield acceptable performance. >> >> thanks, >> Wes >> >> On Fri, Feb 26, 2016 at 6:01 AM, Daniel Robinson >> <danrobinson...@gmail.com> wrote: >> > In C++ at least, I think making Nullable<T> a template (see here: >> > https://github.com/danrobinson/arrow-demo/blob/master/types.h) would >> make >> > it easier to both define arbitrary classes and to write parsers that take >> > advantage of template specialization. >> > >> > Re: 2 vs 3 code paths: handling the null-skipping case can be a single >> line >> > of code at the start of the function (which could be reduced to a macro): >> > >> > if (arr.null_count() == 0) return ALGORITHM_NAME(arr.child_array()); >> > >> > And it seems like good practice anyway to separate the null_count=0 code >> > into a separate function. >> > >> > -- > -- > Cheers, > Leif