I meant "We probably don't want std::vector<std::pair<boolean, int_t>>"

On Fri, Feb 26, 2016 at 10:50 PM Leif Walsh <leif.wa...@gmail.com> wrote:

> In the abstract (since I haven't written any code), let me see if I can
> make an argument for considering "nullable int" and "int" to both be
> worthwhile "primitive" types, as opposed to "Nullable<int>" being a
> constructed type over the primitive type "int", in the C++ arena.
>
> Let's assume Arrow's use case is to manage arrays of numbers, i.e.
> Array<number_t>.  We have two choices for nullability:
>
> Array<nullable_int_t>
> Array<Nullable<int_t>>
>
> I think what we want from the data structure is for an array of nullable
> ints (ignoring special cases like an array of nullable ints where none of
> the ints, at runtime, happens to be null) to be laid out in memory as
> std::pair<std::vector<bool>, std::vector<int_t>>.  We probably don't want
> std::vector<std::pair<boolean, int_t>, because cpus are a Thing (let me
> know if this shorthand doesn't make sense, I can elaborate).
>
> If we define separate nullable primitive types and non-nullable primitive
> types, then we can template specialize for the nullable types and factor
> out the null bits into their own array.  If we require than Nullable be its
> own template, it's also possible to specialize the Array template for
> Array<Nullable<T>> but I think the template code becomes a lot more
> complex.  I'd be happy to be proven wrong here, but for now I'll assume
> that.
>
> I don't think we have many other singleton typeclasses like Nullable that
> we want to apply to single primitive types.  In fact, I can't think of any
> others that would be useful.  Given that, we're only multiplying the number
> of primitive types by two, we're not at risk of exploding the number of
> primitive types, and we're probably greatly simplifying the template
> implementations of container templates like Array.
>
> If you can think of other useful primitive templates, or if you can
> demonstrate that Array<Nullable<T>> is simple in all languages, I would
> change my position on this.
>
> On Fri, Feb 26, 2016 at 11:06 AM Wes McKinney <w...@cloudera.com> wrote:
>
>> To paraphrase a great poet, "mo' templates, 'mo problems". I agree
>> that some theoretical benefits may be reaped in exchange for
>> significantly higher code complexity / likely lower productivity for
>> both developers and users of the library. We would need to see
>> pragmatic argument why the whole library should be made much more
>> complex in exchange compile-time benefits in a small portion of the
>> code.
>>
>> Probably the biggest issue I see with this is the combinatorial
>> explosion of generated code. For example, let's consider the array
>> function Take(T, Integer) -> T (for example, numpy.ndarray.take). If
>> you introduce nullable types, rather than generating one variant for
>> each type T and integer type, you need 4:
>>
>> Take(T, Int) -> T
>> Take(T, NullableInt) -> NullableT (indices have nulls) or T (indices
>> have no nulls)
>> Take(NullableT, Int) -> NullableT
>> Take(NullableT, NullableInt) -> NullableT
>>
>> If you add to this the fact that any nullable index type may not have
>> any nulls, you actually have more than 4 branches of logic to
>> consider.
>>
>> In Java, this would be less of a concern, because all functions are
>> effectively virtual (dynamic dispatch overhead is something the JIT
>> largely takes care of), but in C++ using virtual functions to make the
>> arrays more "dynamic" (i.e. using NullableT or T in the same code
>> path) would not yield acceptable performance.
>>
>> thanks,
>> Wes
>>
>> On Fri, Feb 26, 2016 at 6:01 AM, Daniel Robinson
>> <danrobinson...@gmail.com> wrote:
>> > In C++ at least, I think making Nullable<T> a template (see here:
>> > https://github.com/danrobinson/arrow-demo/blob/master/types.h) would
>> make
>> > it easier to both define arbitrary classes and to write parsers that
>> take
>> > advantage of template specialization.
>> >
>> > Re: 2 vs 3 code paths: handling the null-skipping case can be a single
>> line
>> > of code at the start of the function (which could be reduced to a
>> macro):
>> >
>> > if (arr.null_count() == 0) return ALGORITHM_NAME(arr.child_array());
>> >
>> > And it seems like good practice anyway to separate the null_count=0 code
>> > into a separate function.
>> >
>>
> --
> --
> Cheers,
> Leif

-- 
-- 
Cheers,
Leif

Reply via email to