Hi Ben and Ben, I think it would be good to have a convention for by default filling null slots in arrays with known value. I think it might be a mistake to use zero as the value because it can lead to reliance on this behavior. Secure by default is a good approach to take.
For kernels in particular, it might be nice to make this configurable. I can imagine when we get to the point of kernels lazily chained together, it might be worth postponing the filling until the final kernel processes the data, because intermediate kernels might end up populating or discarding the nulls. Even simple operations such as > u32 + u32 where the user knows that the numbers are small can cause > problems because a null slot may contain an (unitialized) number close to > u32::MAX and the SIMD addition may overflow on the null slot Does uint32 overflow on SIMD cause issues (I would think this would have to be handled uniformly or could be skipped when you know the values are small)? Or is this simply a performance consideration? -Micah On Sat, Feb 20, 2021 at 1:21 PM Antoine Pitrou <anto...@python.org> wrote: > > I am definitely in the camp that we should not leak past data through > uninitialized Arrow memory (for example by transmitting such buffers > using Arrow IPC). > > Regards > > Antoine. > > > Le 20/02/2021 à 21:17, Benjamin Kietzman a écrit : > > Original discussion at > > https://github.com/apache/arrow/pull/9471#issuecomment-779944257 (PR for > > https://issues.apache.org/jira/browse/ARROW-11595 ) > > > > Although the format does not specify what is contained in array slots > > masked by null bits (for example the first byte in the data buffer of an > > int8 array whose first slot is null), there are other considerations > which > > might motivate establishing conventions for some arrays created by the > C++ > > implementation: > > - No spurious complaints from valgrind when running otherwise safe > > element-wise compute kernels on values under null bits. In the case of > > ARROW-11595, the values buffer of the result of casting from Type::NA to > > Type::INT8 is left uninitialized but masked by an entirely-null validity > > bitmap. When such an array is passed to a comparison kernel, a branch on > > the uninitialized values triggered valgrind even though the results of > that > > branch were also masked by an empty validity bitmap. > > - If the underlying values were allocated but not initialized they may > leak > > private information such as private keys, passwords, or tokens which were > > placed in that memory then freed by an application without overwrite > > - Improved compression of data buffers (for example in writing to the IPC > > format), since a run of nulls would correspond to consistent, repeated > > values in all buffers > > - Deterministic output from operations which are unable to honor null > > bitmaps, such as computing the checksum of an IPC file > > >