> > If the underlying values were allocated but not initialized they may leak > private information such as private keys, passwords, or tokens which were > placed in that memory then freed by an application without overwrite >
I would not be concerned with the security implications of reading those values: the owner of a memory region is responsible for erasing its contents before deallocing it, if they which so. IMO there is no expectation that a program must consider all regions returned by malloc as containing sensitive information. > Does uint32 overflow on SIMD cause issues (I would think this would have to > be handled uniformly or could be skipped when you know the values are > small)? Or is this simply a performance consideration? > I would say It is both an API and a non-determinism concern. Let's say the user has an operation X that may overflow (e.g. a + b + 1), and they would like to use it on the buffer and leverage LLVM's auto-vectorization. In Rust, that is essentially `let iter = lhs_rhs_zip.map(|x, y| x + y + 1); Buffer::from_trusted_len_iter(iter);`. 1. API: If we do not initialize values, the operation must be `lhs_rhs_zip.map(|x, y| x.saturating_add(y).saturating_add(1))` instead, as any null can overflow. IMO this forces the user to change its semantic intent on valid slots to address uninitialized values on null slots. 2. non-deterministic (1): when the user uses `a + b + 1` and null slots are un-initialized, the operation may or may not overflow on null slots, and thus the program becomes non-deterministic. This can lead to crashes that are difficult to trigger, as an execution may or may not crash. 3. non-determinism (2): many types have "trap values <https://en.cppreference.com/w/cpp/types/numeric_limits/traps>", representations that crash the program if arithmetics are performed on them. I think we are missing the other side, though; the benefits. As anyone measured the performance impact of initializing vs not initializing as a function of e.g. the null density of the array? Best, Jorge On Sat, Feb 20, 2021 at 10:37 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Ben and Ben, > I think it would be good to have a convention for by default filling null > slots in arrays with known value. I think it might be a mistake to use > zero as the value because it can lead to reliance on this behavior. Secure > by default is a good approach to take. > > For kernels in particular, it might be nice to make this configurable. I > can imagine when we get to the point of kernels lazily chained together, it > might be worth postponing the filling until the final kernel processes the > data, because intermediate kernels might end up populating or discarding > the nulls. > > Even simple operations such as > > u32 + u32 where the user knows that the numbers are small can cause > > problems because a null slot may contain an (unitialized) number close to > > u32::MAX and the SIMD addition may overflow on the null slot > > Does uint32 overflow on SIMD cause issues (I would think this would have to > be handled uniformly or could be skipped when you know the values are > small)? Or is this simply a performance consideration? > > -Micah > > > > > > > On Sat, Feb 20, 2021 at 1:21 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > I am definitely in the camp that we should not leak past data through > > uninitialized Arrow memory (for example by transmitting such buffers > > using Arrow IPC). > > > > Regards > > > > Antoine. > > > > > > Le 20/02/2021 à 21:17, Benjamin Kietzman a écrit : > > > Original discussion at > > > https://github.com/apache/arrow/pull/9471#issuecomment-779944257 (PR > for > > > https://issues.apache.org/jira/browse/ARROW-11595 ) > > > > > > Although the format does not specify what is contained in array slots > > > masked by null bits (for example the first byte in the data buffer of > an > > > int8 array whose first slot is null), there are other considerations > > which > > > might motivate establishing conventions for some arrays created by the > > C++ > > > implementation: > > > - No spurious complaints from valgrind when running otherwise safe > > > element-wise compute kernels on values under null bits. In the case of > > > ARROW-11595, the values buffer of the result of casting from Type::NA > to > > > Type::INT8 is left uninitialized but masked by an entirely-null > validity > > > bitmap. When such an array is passed to a comparison kernel, a branch > on > > > the uninitialized values triggered valgrind even though the results of > > that > > > branch were also masked by an empty validity bitmap. > > > - If the underlying values were allocated but not initialized they may > > leak > > > private information such as private keys, passwords, or tokens which > were > > > placed in that memory then freed by an application without overwrite > > > - Improved compression of data buffers (for example in writing to the > IPC > > > format), since a run of nulls would correspond to consistent, repeated > > > values in all buffers > > > - Deterministic output from operations which are unable to honor null > > > bitmaps, such as computing the checksum of an IPC file > > > > > >