I meant "We probably don't want std::vector<std::pair<boolean, int_t>>"
On Fri, Feb 26, 2016 at 10:50 PM Leif Walsh <leif.wa...@gmail.com> wrote: > In the abstract (since I haven't written any code), let me see if I can > make an argument for considering "nullable int" and "int" to both be > worthwhile "primitive" types, as opposed to "Nullable<int>" being a > constructed type over the primitive type "int", in the C++ arena. > > Let's assume Arrow's use case is to manage arrays of numbers, i.e. > Array<number_t>. We have two choices for nullability: > > Array<nullable_int_t> > Array<Nullable<int_t>> > > I think what we want from the data structure is for an array of nullable > ints (ignoring special cases like an array of nullable ints where none of > the ints, at runtime, happens to be null) to be laid out in memory as > std::pair<std::vector<bool>, std::vector<int_t>>. We probably don't want > std::vector<std::pair<boolean, int_t>, because cpus are a Thing (let me > know if this shorthand doesn't make sense, I can elaborate). > > If we define separate nullable primitive types and non-nullable primitive > types, then we can template specialize for the nullable types and factor > out the null bits into their own array. If we require than Nullable be its > own template, it's also possible to specialize the Array template for > Array<Nullable<T>> but I think the template code becomes a lot more > complex. I'd be happy to be proven wrong here, but for now I'll assume > that. > > I don't think we have many other singleton typeclasses like Nullable that > we want to apply to single primitive types. In fact, I can't think of any > others that would be useful. Given that, we're only multiplying the number > of primitive types by two, we're not at risk of exploding the number of > primitive types, and we're probably greatly simplifying the template > implementations of container templates like Array. > > If you can think of other useful primitive templates, or if you can > demonstrate that Array<Nullable<T>> is simple in all languages, I would > change my position on this. > > On Fri, Feb 26, 2016 at 11:06 AM Wes McKinney <w...@cloudera.com> wrote: > >> To paraphrase a great poet, "mo' templates, 'mo problems". I agree >> that some theoretical benefits may be reaped in exchange for >> significantly higher code complexity / likely lower productivity for >> both developers and users of the library. We would need to see >> pragmatic argument why the whole library should be made much more >> complex in exchange compile-time benefits in a small portion of the >> code. >> >> Probably the biggest issue I see with this is the combinatorial >> explosion of generated code. For example, let's consider the array >> function Take(T, Integer) -> T (for example, numpy.ndarray.take). If >> you introduce nullable types, rather than generating one variant for >> each type T and integer type, you need 4: >> >> Take(T, Int) -> T >> Take(T, NullableInt) -> NullableT (indices have nulls) or T (indices >> have no nulls) >> Take(NullableT, Int) -> NullableT >> Take(NullableT, NullableInt) -> NullableT >> >> If you add to this the fact that any nullable index type may not have >> any nulls, you actually have more than 4 branches of logic to >> consider. >> >> In Java, this would be less of a concern, because all functions are >> effectively virtual (dynamic dispatch overhead is something the JIT >> largely takes care of), but in C++ using virtual functions to make the >> arrays more "dynamic" (i.e. using NullableT or T in the same code >> path) would not yield acceptable performance. >> >> thanks, >> Wes >> >> On Fri, Feb 26, 2016 at 6:01 AM, Daniel Robinson >> <danrobinson...@gmail.com> wrote: >> > In C++ at least, I think making Nullable<T> a template (see here: >> > https://github.com/danrobinson/arrow-demo/blob/master/types.h) would >> make >> > it easier to both define arbitrary classes and to write parsers that >> take >> > advantage of template specialization. >> > >> > Re: 2 vs 3 code paths: handling the null-skipping case can be a single >> line >> > of code at the start of the function (which could be reduced to a >> macro): >> > >> > if (arr.null_count() == 0) return ALGORITHM_NAME(arr.child_array()); >> > >> > And it seems like good practice anyway to separate the null_count=0 code >> > into a separate function. >> > >> > -- > -- > Cheers, > Leif -- -- Cheers, Leif