The point I was trying to argue is that making the choice of a value from a set of values to use as a sentinel has consequences that are worth thinking about. One consequence of choosing such a value is a (hypothetical) isspace and an isnull (called null in KDB+) function would give indistinguishable results. When using KDB+ I found it disturbing that "remove all null values from a column" is synonymous with "remove all strings whose value is a single space". You'd actually have to convert your string column to integers and do the comparison to the value 32 (assuming you're in ASCII or a subset of it, I'm not sure if other encodings represent spaces differently) to remove spaces. Even then, in KDB+, you'd also be removing nulls. This suggests to me that someone made an assumption at some point in KDB+ that spaces were rare enough that giving them a completely different meaning was "okay". I can tell you from working in a large advertising company, that it is critical to have a reliable way to distinguish null from not null, and a space had better be not null. It's possible that my knowledge about kdb+ is out of date, in which case I'd be happy to know that this behavior no longer exists in KDB+.
On Fri, Nov 9, 2018 at 4:51 PM Matt Dowle <mattjdo...@gmail.com> wrote: > > There is one database that I'm aware of that uses sentinels _and_ > supports complex types with missing values: Kx's KDB+. > I read this and was pleased that KDB is being used as a reference. It is a > seriously good database: the gold-standard in many people's eyes. > > > This has led to some seriously strange choices like the ASCII space > character being used as the sentinel value for strings. > But then I saw this. Surely if sentinels are good enough for KDB then isn't > that a sign that sentinels are not as bad as this group fears? > > What about grouping and joining columns that contain NA? Here's an > example from R data.table : > > > DT = data.table(x=c(1,3,3,NA,1,NA), v=1:6) > > DT > x v > <num> <int> > 1: 1 1 > 2: 3 2 > 3: 3 3 > 4: NA 4 > 5: 1 5 > 6: NA 6 > > DT[,sum(v),keyby=x] > x V1 > <num> <int> > 1: NA 10 > 2: 1 6 > 3: 3 5 > > The NAs are grouped as a distinct value and are not excluded for > statistical robustness reasons. This is very easy to achieve efficiently > internally; in fact there is no special code to deal with the NA values > because they are just another distinct value (the sentinel). In Arrow if a > bitmap is present, there would be more code needed to deal with the NAs > (either way: including the NA group or excluding the NA group), if I > understand correctly. > > On Thu, Nov 8, 2018 at 3:18 PM Phillip Cloud <cpcl...@gmail.com> wrote: > > > There is one database that I'm aware of that uses sentinels _and_ > supports > > complex types with missing values: Kx's KDB+. This has led to some > > seriously strange choices like the ASCII space character being used as > the > > sentinel value for strings. See > > https://code.kx.com/wiki/Reference/Datatypes for > > more details. > > > > On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > hey Matt, > > > > > > Thanks for giving your perspective on the mailing list. > > > > > > My objective in writing about this recently > > > (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I > > > need to update since the sentinel case can be done more efficiently > > > than what's there now) was to help dispel the notion that using a > > > separate value (bit or byte) to encode nullness is a performance > > > compromise to comply with the requirements of database systems. I too > > > prefer real world benchmarks to microbenchmarks, and probably null > > > checking is not going to be the main driver of aggregate system > > > performance. I had heard many people over the years object to bitmaps > > > on performance grounds but without analysis to back it up. > > > > > > Some context for other readers on the mailing list: A language like R > > > is not a database and has fewer built-in scalar types: int32, double, > > > string (interned), and boolean. Out of these, int32 and double can use > > > one bit pattern for NA (null) and not lose too much. A database system > > > generally can't make that kind of compromise, and most popular > > > databases can distinguish INT32_MIN (or any other value used as a > > > sentinel) and null. If you loaded data from an Avro or Parquet file > > > that contained one of those values, you'd have to decide what to do > > > with the data (though I understand there's integer64 add-on packages > > > for R now) > > > > > > Now back to Arrow -- we have 3 main kinds of data types: > > > > > > * Fixed size primitive > > > * Variable size primitive (binary, utf8) > > > * Nested (list, struct, union) > > > > > > Out of these, "fixed size primitive" is the only one that can > > > generally support O(1) in-place mutation / updates, though all of them > > > could support a O(1) "make null" operation (by zeroing a bit). In > > > general, when faced with designs we have preferred choices benefiting > > > use cases where datasets are treated as immutable or copy-on-write. > > > > > > If an application _does_ need to do mutation on primitive arrays, then > > > you could choose to always allocate the validity bitmap so that it can > > > be mutated without requiring allocations to happen arbitrarily in your > > > processing workflow. But, if you have data without nulls, it is a nice > > > feature to be able to ignore the bitmap or not allocate one at all. If > > > you constructed an array from data that you know to be non-nullable, > > > some implementations might wish to avoid the waste of creating a > > > bitmap with all 1's. > > > > > > For example, if we create an array::Array from a normal NumPy array of > > > integers (which cannot have nulls), we have > > > > > > In [6]: import pyarrow as pa > > > In [7]: import numpy as np > > > In [8]: arr = pa.array(np.array([1, 2, 3, 4])) > > > > > > In [9]: arr.buffers() > > > Out[9]: [None, <pyarrow.lib.Buffer at 0x7f34ecd3eea0>] > > > > > > In [10]: arr.null_count > > > Out[10]: 0 > > > > > > Normally, the first buffer would be the validity bitmap memory, but > > > here it was not allocated because there are no nulls. > > > > > > Creating an open standard data representation is a difficult thing; > > > one cannot be "all things to all people" but the intent is to be a > > > suitable lingua franca for language agnostic data interchange and as a > > > runtime representation for analytical query engines (where most > > > operators are "pure"). If the Arrow community's goal were to create a > > > "mutable column store" then some things might be designed differently > > > (perhaps more like internals of https://kudu.apache.org/). It is > > > helpful to have an understanding of what compromises have been made > > > and how costly they are in real world applications. > > > > > > best > > > Wes > > > On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau <jacq...@apache.org> > > wrote: > > > > > > > > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdo...@gmail.com> > > wrote: > > > > > > > > > 1. I see. Good idea. Can we assume bitmap is always present in > Arrow > > > then? > > > > > I thought I'd seen Wes argue that if there were no NAs, the bitmap > > > doesn't > > > > > need to be allocated. Indeed I wasn't worried about the extra > > storage, > > > > > although for 10,000 columns I wonder about the number of vectors. > > > > > > > > > > > > > I think different implementations handle this differently at the > > moment. > > > In > > > > the Java code, we allocate the validity buffer at initial allocation > > > > always. We're also looking to enhance the allocation strategy so the > > > fixed > > > > part of values are always allocated with validity (single allocation) > > to > > > > avoid any extra object housekeeping. > > > > > > > > > > > > > 2. It's only subjective until the code complexity is measured, then > > > it's > > > > > not subjective. I suppose after 20 years of using sentinels, I'm > used > > > to it > > > > > and trust it. I'll keep an open mind on this. > > > > > > > > > Yup, fair enough. > > > > > > > > > > > > > 3. Since I criticized the scale of Wes' benchmark, I felt I should > > > show how > > > > > I do benchmarks myself to show where I'm coming from. Yes > none-null, > > > > > some-null and all-null paths offer savings. But that's the same > under > > > both > > > > > sentinel and bitmap approaches. Under both approaches, you just > need > > to > > > > > know which case you're in. That involves storing the number of NAs > in > > > the > > > > > header/summary which can be done under both approaches. > > > > > > > > > > > > > The item we appreciate is that you can do a single comparison every > 64 > > > > values to determine which of the three cases you are in (make this a > > > local > > > > decision). This means you don't have to do housekeeping ahead of > time. > > It > > > > also means that the window of choice is narrow, minimizing the > penalty > > in > > > > situations where you have rare invalid values (or rare valid values). > > > > > >