Moving this thread over from the discussion about adding null count to the physical format.
I never said that what you're describing an invalid approach, only that it will yield more complexity for both library developers and users without any clear performance or net productivity benefits. This is the kind of C++ codebase I would personally choose not to be involved with. At this point it's fairly hypothetical; perhaps we can revisit in a few months after Arrow gets used for some real-world applications. It's probably a philosophical divide but I like C++ as a tool (compared with plain old C) for several reasons: - High performance C tends to encourage much more macro use (manual code generation, basically) - As a code generation tool, templates are more sane and give better compiler errors than C macros. - Object-oriented programming in C requires a lot of boilerplate. See an example C codebase (https://github.com/torch/TH) using an opinionated flavor of OOP, you end up with a half-reimplementation of C++ classes! - Memory-management using RAII and smart pointers makes me personally a lot more productive with fewer mistakes In particular, the Google C++ guide counterindicates complicated template metaprogramming: https://google.github.io/styleguide/cppguide.html#Template_metaprogramming "The techniques used in template metaprogramming are often obscure to anyone but language experts. Code that uses templates in complicated ways is often unreadable, and is hard to debug or maintain. Template metaprogramming often leads to extremely poor compiler time error messages: even if an interface is simple, the complicated implementation details become visible when the user does something wrong." The great part of Arrow is that the memory layout specification is what really matters, so there is nothing stopping anyone from creating alternate implementations that suit their needs, and if you need to use functions from different implementations in an application, you can do that because the memory is binary interoperable. My intent for the C++ codebase is to make it the fastest reference code available for these data structures while also readable and accessible for a wide variety of programmers to contribute to, so adding template metaprogramming constructs (as opposed to mainly using templates primarily for code generation) might drive away certain kinds of contributors. I would like for many of the algorithms to not end up too dissimilar from the ones you would write in C. - Wes On Fri, Mar 4, 2016 at 6:50 AM, Daniel Robinson <danrobinson...@gmail.com> wrote: > Wes, > > Thanks for soliciting so much input on these questions, and sharing the new > prototypes. > > In response to point 2 and your e-mail from last week, I created some > prototypes to illustrate what I think could be useful about having a > Nullable<T> template in the C++ implementation. > > As far as code complexity, I think having a Nullable<T> type might simplify > the user interface (including the definitions of algorithms) by enabling > more generic programming, at the cost of some template-wrestling on the > developer side. You mentioned Take ( > http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.take.html). > As a somewhat silly illustration, nullable typing could allow Take to be > implemented in one line ( > https://github.com/danrobinson/arrow-demo/blob/master/take.h#L9): > > return map<GetOperation>(index_array, array); > > Behind the scenes, map() does a runtime check that short-circuits the > null-handling logic ( > https://github.com/danrobinson/arrow-demo/blob/master/map.h#L55-L81; this > code seems to be irreducibly ugly but at least it's somewhat generic). It > then runs the array through an algorithm written in continuation-passing > style (https://github.com/danrobinson/arrow-demo/blob/master/map.h#L20-L22), > which in turn constructs an operation pipeline where each operation can > either call "step" (to yield a value) or "skip" (to yield a null value) on > the next operation. Thanks to the nullable type, there are two versions of > the get operation: one that checks for nulls, and one that knows it doesn't > have to. ( > https://github.com/danrobinson/arrow-demo/blob/master/operations.h#L8-L55). > > I'm not actually trying to push any of these half-baked functional > paradigms, let alone the hacky template metaprogramming tricks used to > implement some of them in the prototype. The point I'm trying to > illustrate is that these kinds of abstractions would be more difficult to > implement without Nullable<T> typing, because without types, you can't > efficiently pass the information about whether an array has nulls or not > from function to function (and ultimately to the function that processes > each row). (Perhaps I'm missing something!) > > Here's Take implemented as a single monolithic function that isn't aware of > nullability: https://github.com/danrobinson/arrow-demo/blob/master/take.h. > In my tests this is about 5-10% faster than the map() version and I expect > it would maintain an advantage if both were better optimized. Maybe a > 45-line function like this is worth it for the core functions, but it might > be useful to expose higher-order functions like map() to C++ developers. > > As for performance, code generation, and static polymorphism—is the issue > roughly that we need compiled instantiations of every function that might > be called, with every possible type, because at compile time we don't know > the structure of the data or what functions people may want to call from > (say) interpreted languages? I hadn't appreciated that, and it does seem > like a risk of using templates, but I think it actually increases the > upside of factoring out logic into abstractions like map(). >