>
> If the underlying values were allocated but not initialized they may leak
> private information such as private keys, passwords, or tokens which were
> placed in that memory then freed by an application without overwrite
>

I would not be concerned with the security implications of reading those
values: the owner of a memory region is responsible for erasing its
contents before deallocing it, if they which so. IMO there is no
expectation that a program must consider all regions returned by malloc as
containing sensitive information.


> Does uint32 overflow on SIMD cause issues (I would think this would have to
> be handled uniformly or could be skipped when you know the values are
> small)?  Or is this simply a performance consideration?
>

I would say It is both an API and a non-determinism concern.

Let's say the user has an operation X that may overflow (e.g. a + b + 1),
and they would like to use it on the buffer and leverage LLVM's
auto-vectorization. In Rust, that is essentially `let iter =
lhs_rhs_zip.map(|x, y| x + y + 1); Buffer::from_trusted_len_iter(iter);`.

1. API: If we do not initialize values, the operation must be
`lhs_rhs_zip.map(|x, y| x.saturating_add(y).saturating_add(1))` instead, as
any null can overflow. IMO this forces the user to change its semantic
intent on valid slots to address uninitialized values on null slots.

2. non-deterministic (1): when the user uses `a + b + 1` and null slots are
un-initialized, the operation may or may not overflow on null slots, and
thus the program becomes non-deterministic. This can lead to crashes that
are difficult to trigger, as an execution may or may not crash.

3. non-determinism (2): many types have "trap values
<https://en.cppreference.com/w/cpp/types/numeric_limits/traps>",
representations that crash the program if arithmetics are performed on
them.

I think we are missing the other side, though; the benefits. As anyone
measured the performance impact of initializing vs not initializing as a
function of e.g. the null density of the array?

Best,
Jorge



On Sat, Feb 20, 2021 at 10:37 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Ben and Ben,
> I think it would be good to have a convention for by default filling null
> slots in arrays with known value.  I think it might be a mistake to use
> zero as the value because it can lead to reliance on this behavior.  Secure
> by default is a good approach to take.
>
> For kernels in particular, it might be nice to make this configurable.  I
> can imagine when we get to the point of kernels lazily chained together, it
> might be worth postponing the filling until the final kernel processes the
> data, because intermediate kernels might end up populating or discarding
> the nulls.
>
> Even simple operations such as
> > u32 + u32 where the user knows that the numbers are small can cause
> > problems because a null slot may contain an (unitialized) number close to
> > u32::MAX and the SIMD addition may overflow on the null slot
>
> Does uint32 overflow on SIMD cause issues (I would think this would have to
> be handled uniformly or could be skipped when you know the values are
> small)?  Or is this simply a performance consideration?
>
> -Micah
>
>
>
>
>
>
> On Sat, Feb 20, 2021 at 1:21 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > I am definitely in the camp that we should not leak past data through
> > uninitialized Arrow memory (for example by transmitting such buffers
> > using Arrow IPC).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 20/02/2021 à 21:17, Benjamin Kietzman a écrit :
> > > Original discussion at
> > > https://github.com/apache/arrow/pull/9471#issuecomment-779944257 (PR
> for
> > > https://issues.apache.org/jira/browse/ARROW-11595 )
> > >
> > > Although the format does not specify what is contained in array slots
> > > masked by null bits (for example the first byte in the data buffer of
> an
> > > int8 array whose first slot is null), there are other considerations
> > which
> > > might motivate establishing conventions for some arrays created by the
> > C++
> > > implementation:
> > > - No spurious complaints from valgrind when running otherwise safe
> > > element-wise compute kernels on values under null bits. In the case of
> > > ARROW-11595, the values buffer of the result of casting from Type::NA
> to
> > > Type::INT8 is left uninitialized but masked by an entirely-null
> validity
> > > bitmap. When such an array is passed to a comparison kernel, a branch
> on
> > > the uninitialized values triggered valgrind even though the results of
> > that
> > > branch were also masked by an empty validity bitmap.
> > > - If the underlying values were allocated but not initialized they may
> > leak
> > > private information such as private keys, passwords, or tokens which
> were
> > > placed in that memory then freed by an application without overwrite
> > > - Improved compression of data buffers (for example in writing to the
> IPC
> > > format), since a run of nulls would correspond to consistent, repeated
> > > values in all buffers
> > > - Deterministic output from operations which are unable to honor null
> > > bitmaps, such as computing the checksum of an IPC file
> > >
> >
>

Reply via email to