In C++ at least, I think making Nullable<T> a template (see here:
https://github.com/danrobinson/arrow-demo/blob/master/types.h) would make
it easier to both define arbitrary classes and to write parsers that take
advantage of template specialization.

Re: 2 vs 3 code paths: handling the null-skipping case can be a single line
of code at the start of the function (which could be reduced to a macro):

if (arr.null_count() == 0) return ALGORITHM_NAME(arr.child_array());

And it seems like good practice anyway to separate the null_count=0 code
into a separate function.


On Thu, Feb 25, 2016 at 11:25 PM, Wes McKinney <w...@cloudera.com> wrote:

> Inline
>
> On Thu, Feb 25, 2016 at 7:54 PM, Daniel Robinson
> <danrobinson...@gmail.com> wrote:
> > Thanks; specific responses inline below.
> >
> > I'm mostly persuaded, though—I never doubted the C++ code will end up
> > perfectly optimized. My main interest in this is the type definitions at
> > the metadata level, for which I really do think Nullable<T> should be
> > considered a nested type like List<T> or Struct<T, U>.
> >
>
> There is little difference in fundamental semantics and no difference
> in memory layout, so it would probably be better in the main reference
> implementations to choose the more pragmatic option. It's worth noting
> that in Parquet metadata, variable-length collections are indeed
> nested types. For example, ?List<?int32> could be represented in
> Parquet metadata as (so-called 3-level array coding):
>
> optional group bag
>   repeated group list
>     optional int32 item
>
> I haven't been far enough down the path of using Arrow / ValueVectors
> as the Apache Drill team has, so I'd be interested in hearing more of
> their perspective on this.
>
> We have not yet started coming up with the canonical
> (cross-implementation) metadata for Arrow IPC / RPC -- I think that
> would be helpful to have alongside container classes to use for
> transporting and manipulating the physical data.
>
> >
> > On Thu, Feb 25, 2016 at 9:17 PM, Wes McKinney <w...@cloudera.com> wrote:
> >
> >> Inline responses
> >>
> >> On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson
> >> <danrobinson...@gmail.com> wrote:
> >> > Hi Wes,
> >> >
> >> > Thanks for the response!
> >> >
> >> > I see the appeal of representing all arrays at the data structure
> level
> >> as
> >> > "nullable," but I think I've convinced myself that it's better to
> >> represent
> >> > them all as non-nullable, and bolt on "nullability" as a wrapper
> type. In
> >> > addition to the reasons I mentioned yesterday (such as localizing
> logic
> >> for
> >> > dealing with null values in its own class), I think there's a pretty
> >> strong
> >> > efficiency case for it.  If you don't distinguish between nullable and
> >> > non-nullable data structures at the data structure level, you're
> >> > essentially hiding useful optimization information from the compiler.
> >> >
> >>
> >> I question whether this optimization information is useful given many
> >> of the target workloads (e.g. scan-based analytics). On one code
> >> branch (if null_count > 0), a bitmap is examined either 1-bit at a
> >> time or using popcount (1 word's worth of values at a time -- this
> >> could improve CPU instruction pipelining), in the other branch the
> >> bitmap is ignored.
> >>
> >>
> > Makes sense.
> >
> >> For example, arrays should probably have a simple API for getting a
> value
> >> > from a specified slot number, something like get(int32_t offset). This
> >> > makes it easier to write algorithms that don't have to worry about the
> >> > underlying memory structure of the array. If all arrays at the data
> >> > structure level are considered "nullable", this code would have to do
> an
> >> > "if (null_count == 0)" check every time get() is called, which is
> >> > inefficient and somewhat inelegant.
> >> >
> >>
> >> I have been expecting that any user of an Arrow C/C++ library would be
> >> expected to understand the physical memory layout; with functions to
> >> access values made available as a convenience. So I would not want you
> >> to call get(i) unless you know that it is not null and not out of
> >> bounds.
> >>
> >
> > I included a safe_get function that assumes the value is within bounds,
> as
> > I discuss below. But I see the case for making the main get() API assume
> > that the value is neither null nor out of bounds.
> >
> >
> >
> >>
> >> > I made a demo showing how I'd envision implementing nullable arrays in
> >> C++
> >> > with nested types, and how algorithms on them would work:
> >> > https://github.com/danrobinson/arrow-demo
> >> >
> >> > Right now it's just nullable and primitive types and arrays, but I'll
> try
> >> > adding list types and builders to see how it compares to the
> >> > string-handling code you linked to.
> >> >
> >> > One key advantage I see of making Nullable<T> its own nested type is
> that
> >> > the algorithm code can make full use of C++'s template specialization.
> >> For
> >> > example, here's how I define the versions of the Mean algorithm that
> work
> >> > on nullable arrays and those that work on non-nullable arrays (see
> >> > https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h):
> >> >
> >> > // happens when you call Mean() on a non-nullable array
> >> > template<typename T>
> >> > double Mean(Array<T>& arr) {
> >> >   return (double)Sum(arr) / (arr.length());
> >> > }
> >> >
> >> > // happens when you call Mean() on a nullable array
> >> > template<typename T>
> >> > double Mean(Array< Nullable<T> >& nullableArr) {
> >> >   return (double)Sum(nullableArr) / (nullableArr.length() -
> >> > nullableArr.null_count());
> >> > }
> >> >
> >> > This is only possible if nullability is included in the
> >> > data-structure-level type information.
> >> >
> >> > As you suggested, the code also uses a shortcut when Sum() is called
> on a
> >> > "nullable" array that has null_count of 0: it just calls Sum() on its
> >> child
> >> > array (see line 24). I think this, too, is simpler if we aren't
> cramming
> >> > nullability logic into every array class.
> >> >
> >> > There are a few other tricks that could be used whether or not you
> treat
> >> > Nullable as a separate type. For example, we can implement an
> overloaded
> >> > get() function that takes, in addition to a slot offset, a default
> value
> >> > that should be returned if that slot is null (see
> >> > https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114).
> >> This
> >> > let me write the main code for the Sum() algorithm only once, since
> for
> >> its
> >> > purposes, 0 is equivalent to null. (It would also presumably work with
> >> > Product() and more importantly sorting).
> >> >
> >>
> >> Cool, thanks for making a prototype! Working code is worth 1000 words =)
> >>
> >> I'm not convinced by the performance argument. For example, this
> >> snippet has multiple if-then-else branches (which could possibly throw
> >> exceptions) and so would not be advisable to put in the inner loop of
> >> algorithms:
> >>
> >> value_type get(int32_t slotNumber, value_type null_value) {
> >>   check_bounds(slotNumber);
> >>   return safe_get(slotNumber, null_value);
> >> }
> >>
> >> "in time critical code, throwing an exception should be the exception,
> >> not the rule."
> >> http://www.boost.org/community/error_handling.html
> >>
> >>
> > As I mentioned, I included a safe_get() function that doesn't do a bounds
> > check, and that lets you pass a default value to be returned if it is
> null.
> > If inner loops use that, there is only one branch, the null check, and no
> > exceptions. (My Sum() code stupidly didn't use it; I've fixed it now.)
> If
> > they need more complex null-handling logic they could call isNull() on
> the
> > nullable array then call get() on the child array.
> >
> > But I appreciate your point that the most optimized algorithms would need
> > to understand the memory at a low level. Though I wonder how easy it will
> > be to specify such low-level algorithms for recursive algorithms on
> > arbitrarily nested data.
> >
> >> Again, would appreciate any thoughts.
> >> >
> >> > I definitely agree that the interface for users should be the priority
> >> > (after the memory format). My own motivation for thinking about this
> is
> >> my
> >> > interest in making easy APIs for creating or accessing these memory
> >> > structures in higher-level dynamically typed languages.
> >> >
> >>
> >> I agree that making Arrow data structures accessible and easy-to-use
> >> in dynamic languages is desirable. I don't think that adding such
> >> overhead to reference C++ algorithms (which should probably be as
> >> close-to-the-metal and performance-oriented at possible, and expect
> >> users to use them responsibly) is a good idea, since usability /
> >> accessibility generally comes at the expense of CPU cycles and cache
> >> efficiency.
> >>
> >> For example, it would be straightforward to build "airbags", so to
> >> speak, in a Python wrapper C extension, since performance is less
> >> critical in user APIs oriented at accessibility and safety /
> >> error-handling. In dynamic languages this is especially important as
> >> bugs in e.g. Python user code should never result in segfaults/core
> >> dumps.
> >>
> >>
> > I completely agree with this; didn't mean to suggest there should be
> > runtime overhead in the C++ functions that are used in algorithms.  (The
> > exceptions in the functions were meant never to be thrown, but I see the
> > case for leaving them out entirely.)
> >
> > My main concern was avoiding such unnecessary overhead (such as null
> checks
> > in a type that shouldn't be nullable). I think there's no downside to
> > having *shortcuts *at *compile time*—and, almost as significantly,
> simpler
> > and safer code—by encoding some information about nullability in the C++
> > type information. But I appreciate that optimized code would be able to
> > avoid that.
>
> If you have both Nullable<T> and T, since the code for Nullable<T>
> (with null count == 0) and T will be identical algorithmically, you
> have strictly more code to write and maintain (3 code paths instead of
> 2) for every algorithm where skipping null checks improves
> performance.
>
> I would also be very worried about code like this appearing:
>
> Array* CallFOnFoos(Array* arr) {
>   if (arr->nullable()) {
> return F(static_cast<NullableFooArray*>(arr));
>   } else {
> return F(static_cast<FooArray*>(arr));
>   }
> }
>
> My view is that nullability is a metadata concern and mainly serves to
> add complexity to algorithms / analytics (because of the computational
> equivalence with nullable data without any actual nulls) and C++
> development overhead (a more complex type hierarchy to reason about).
>
> (In a FP language like Haskell, such type systems do afford more
> compile-time safety which make them worthwhile, but this isn't as true
> in C++)
>
> - Wes
>
> >
> >
> >
> >> cheers,
> >> Wes
> >>
> >> > On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <w...@cloudera.com>
> wrote:
> >> >
> >> >> hi Dan,
> >> >>
> >> >> Thanks for these thoughts! There's a few separate problems
> >> >>
> >> >> - Physical memory representation
> >> >> - Metadata
> >> >> - Implementation container type layering / user API
> >> >>
> >> >> We aren't expecting any system that uses Arrow to necessarily use one
> >> >> of the reference Arrow implementations, but I expect an ecosystem of
> >> >> tools will develop around the reference implementations (e.g. Parquet
> >> >> or Avro adapters).
> >> >>
> >> >> In the thread I started last week, I proposed striking this
> >> >> distinction and instead relegating REQUIRED (non-nullable) and
> >> >> OPTIONAL (nullable) to the domain of metadata. In my opinion, having
> >> >> nullability in the container types at all adds much code complexity,
> >> >> and the two cases of
> >> >>
> >> >> - null count == 0
> >> >> - non-nullable
> >> >>
> >> >> Must be treated as distinct, but in general algorithmically
> equivalent
> >> >> (e.g. the bitmap need not be allocated, and would be ignored even if
> >> >> so in most analytics code requiring knowledge of null-ness).
> >> >>
> >> >> In trying to come up with functional type systems for representing
> the
> >> >> semantics of the data, I find it is most useful to start from the
> >> >> desired user API (which sufficiently captures the nuances of the
> >> >> physical memory layout) and work backwards to an implementation.
> >> >>
> >> >> I would be interested to see some rough C++ / pseudocode to help
> >> >> illustrate how this would play out in downstream user code using
> >> >> libarrow and its headers, and internal algorithms in that operate on
> >> >> arrow Array instances. I've seen many functional programmers work on
> >> >> these data structure + type system problems and end up chasing their
> >> >> tails for a very long time. On the C++ side, that code you see in
> >> >> apache/arrow is a limited prototype that I haven't used to build any
> >> >> applications yet, so it'd be great to iterate on some working
> >> >> prototypes of other layerings. For example, we could work on:
> >> >>
> >> >>
> >> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199
> >> >>
> >> >> As pseudocode, suppose we had a function:
> >> >>
> >> >> double Sum(DoubleArray* arr) {
> >> >>   double total = 0;
> >> >>   double* values = arr->values();
> >> >>   if (arr->null_count() == 0) {
> >> >>     for (int i = 0; i < arr->length(); ++i) {
> >> >>       total += values[i];
> >> >>     }
> >> >>   } else {
> >> >>     for (int i = 0; i < arr->length(); ++i) {
> >> >>       if (arr->NotNull(i)) {
> >> >>         total += values[i];
> >> >>       }
> >> >>     }
> >> >>   }
> >> >>   return total;
> >> >> }
> >> >>
> >> >> If you had instead NullableDoubleArray and DoubleArray, you might
> want
> >> >> to generate different functions. This would likely be true for any
> >> >> algorithm where nulls are semantically meaningful.
> >> >>
> >> >> For me at least, the most important aspect of Arrow is the memory
> >> >> layout. The purpose of the implementations is to provide a reference
> >> >> user API to make it as convenient and fast as possible to create
> >> >> Arrow-layout data, manipulate data in-memory, and send/receive data
> >> >> via IPC/shared-memory.
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson
> >> >> <danrobinson...@gmail.com> wrote:
> >> >> > (The below mostly uses terminology and examples from the draft spec
> >> >> > <https://github.com/apache/arrow/blob/master/format/Layout.md> and
> >> C++
> >> >> > implementation <https://github.com/apache/arrow/tree/master/cpp>).
> >> >> >
> >> >> > Under the current spec, if I understand it correctly, there are two
> >> >> > versions of every type: a nullable version and a non-nullable
> version.
> >> >> > Whether a type T is nullable is specified by an attribute on that
> >> type,
> >> >> so
> >> >> > declarations look something like T [nullable]:
> >> >> >
> >> >> > Int32 [non-nullable]
> >> >> >
> >> >> > Int32 [nullable]
> >> >> >
> >> >> >
> >> >> > In contrast, a variable-length version of type T is declared using
> a
> >> >> nested
> >> >> > or parametric type, List<T>.
> >> >> >
> >> >> >
> >> >> > List [non-nullable] <Int32 [non-nullable]>
> >> >> > List [nullable] <Int32 [non-nullable]>
> >> >> > List [non-nullable] <Int32 [nullable]>
> >> >> > List [nullable] <Int32 [nullable] >
> >> >> >
> >> >> > Consistent with this, might it make sense to use a nested type,
> >> >> > *Nullable<T>*, to declare a nullable version of that type? A slot
> in
> >> an
> >> >> > array of type Nullable<T> would be able to contain either a null
> >> value or
> >> >> > any value of type T. Using this system, the above examples would
> look
> >> >> like:
> >> >> >
> >> >> > Int32
> >> >> > Nullable<Int32>
> >> >> > List<Int32>
> >> >> >
> >> >> > Nullable<List<Int32>>
> >> >> > List<Nullable<Int32>>
> >> >> > Nullable<List<Nullable<Int32>>>>
> >> >> >
> >> >> >
> >> >> > This could also be paralleled by a tweak to implementations (which
> >> >> wouldn't
> >> >> > change how the data is stored in physical memory). Right now, in
> the
> >> C++
> >> >> > implementation, any array might be nullable, so the base Array
> class
> >> >> > contains the logic necessary to deal with nullable arrays,
> including
> >> some
> >> >> > potentially unnecessary branching.
> >> >> >
> >> >> > Instead, much like how an array of type List<T> is instantiated as
> a
> >> >> > ListArray
> >> >> > <
> >> >>
> >>
> https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h
> >> >> >
> >> >> >  that contains its own int32 array of offsets and a reference to a
> >> child
> >> >> > "values" array of type T
> >> >> > <https://github.com/apache/arrow/blob/master/format/Layout.md>,
> every
> >> >> array
> >> >> > of type Nullable<T> could be instantiated as a NullableArray
> >> containing
> >> >> its
> >> >> > own bitmask and a reference to a child array of type T. (It could
> also
> >> >> keep
> >> >> > track of the null_count, if that idea is implemented.)
> >> >> >
> >> >> > The child array would not have to know that it is wrapped by a
> >> >> > NullableArray, and we could abstract almost all of the logic for
> >> dealing
> >> >> > with null values out of the Array and ArrayBuilder classes.
> >> (ArrayBuilder
> >> >> > would need a method for incrementing an array's length without
> adding
> >> a
> >> >> > particular value, but it already needs that to build a child array
> in
> >> a
> >> >> Sparse
> >> >> > Union
> >> >> > <
> >> >>
> >>
> https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png
> >> >> >
> >> >> > ).
> >> >> >
> >> >> > I see a few potential additional advantages to this approach:
> >> >> >
> >> >> >    - It would effectively halve the number of distinct types.
> >> >> >    - As mentioned above, it would be consistent with how the Arrow
> >> spec
> >> >> >    treats List types and other nested types. Treating "nullability"
> >> as a
> >> >> >    special attribute on each type seems more consistent with how
> >> >> something
> >> >> >    like Dremel treats both "optional" and "repeated" fields.
> >> >> >    - It would also be consistent with how "option types
> >> >> >    <https://en.wikipedia.org/wiki/Option_type>" are conventionally
> >> >> declared
> >> >> >    in other languages. For example, C# has Nullable<T>
> >> >> >    <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia
> >> has
> >> >> >    Nullable{T}
> >> >> >    <
> >> >>
> >>
> http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values
> >> >> >,
> >> >> > Java
> >> >> >    has Optional<T>
> >> >> >    <
> https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html
> >> >,
> >> >> > and Haskell
> >> >> >    has Maybe a
> >> >> >    <
> >> >>
> https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
> >> >> >    - The implementation would be similar to a special case of
> >> >> SparseUnion,
> >> >> >    which makes sense, since type Nullable<T> is equivalent to T ∪
> >> >> NullType,
> >> >> >    where NullType is a type that can only hold the null value.
> >> >> >    - If the spec ever adds a Masking<T> type that allows a
> selection
> >> of a
> >> >> >    vector using a bitmask, I imagine it, too, would be implemented
> >> >> similarly.
> >> >> >    - Parsing a type declaration as a tree would be simpler.
> >> >> >    - Declarations of non-nullable types would be shorter, and
> >> >> non-nullable
> >> >> >    types would be the default. I think this makes sense, since it
> >> means
> >> >> types
> >> >> >    such as Int32 correspond exactly to their C-type, and people
> would
> >> be
> >> >> less
> >> >> >    likely to be surprised by unexpected null values at runtime.
> >> >> Additionally,
> >> >> >    given the overhead (in speed, memory, complexity, and safety) of
> >> >> nullable
> >> >> >    types, I think programmers should have to explicitly declare
> them.
> >> >> >
> >> >> > I'd appreciate any thoughts, especially from those with more
> >> involvement
> >> >> in
> >> >> > the project or experience with related projects.
> >> >> >
> >> >> > One more thought (which could apply even if nullability is kept as
> a
> >> >> > special attribute). To make the string representation of nullable
> >> types
> >> >> > more readable, it might be worth considering a convention from C#
> >> >> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
> >> >> > <
> >> http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>,
> >> >> > Dremel
> >> >> > <
> >> >>
> >>
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
> >> >> >,
> >> >> > and Flow <http://flowtype.org/docs/nullable-types.html>: using
> either
> >> >> T? or
> >> >> > ?T to mean Nullable<T>.
> >> >> >
> >> >> > Int32
> >> >> > ?Int32
> >> >> > List<Int32>
> >> >> > ?List<Int32>
> >> >> > List<?Int32>
> >> >> > ?List<?Int32>
> >> >> >
> >> >> > Int32
> >> >> > Int32?
> >> >> > List<Int32>
> >> >> > List?<Int32>
> >> >> > List<Int32?>
> >> >> > List?<Int32?>
> >> >>
> >> >> My inclination is still to remove nullable from the C++ array
> >> >> container types, but to keep track of the required (non-nullable) /
> >> >> optional (nullable) bit in the type metadata so that it can be kept
> >> >> track of in IPC / schema exchange with other systems as necessary.
> >> >>
> >>
>

Reply via email to