Re: [DISCUSS] Conventions for values masked by null bits

2021-03-08 Thread Jorge Cardoso Leitão
> > If the underlying values were allocated but not initialized they may leak > private information such as private keys, passwords, or tokens which were > placed in that memory then freed by an application without overwrite > I would not be concerned with the security implications of reading thos

Re: [DISCUSS] Conventions for values masked by null bits

2021-03-08 Thread Neal Richardson
Yeah I agree that the general sentinel support has lots of challenges, but in the more narrow case of "read this parquet or CSV file and return an R data.frame", the lifetime of the arrays in question is contained. On Mon, Mar 8, 2021 at 12:42 PM Wes McKinney wrote: > It's a bit outside the scop

Re: [DISCUSS] Conventions for values masked by null bits

2021-03-08 Thread Wes McKinney
It's a bit outside the scope of this discussion, but I've looked at those R Jira issues before, and I think the challenge is how the code will "know" what fill values are being used. If you start putting field-level metadata in a schema object, you're playing a dangerous game if that schema gets at

Re: [DISCUSS] Conventions for values masked by null bits

2021-03-08 Thread Neal Richardson
What was the resolution of this discussion? Was a JIRA made? It occurred to me recently that, if we decided that values masked by null bits need to be filled with a known value, this could open up optimizations in some use cases. For example, when reading a file into R, if we could specify what to

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Antoine Pitrou
Le 21/02/2021 à 01:05, Wes McKinney a écrit : > I agree that we should avoid leaking uninitialized memory in places > where we have control over it. I could imagine a third party project > having UBSAN warnings and then tracing the origin of them to something > in Arrow that they then have to wor

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Wes McKinney
I agree that we should avoid leaking uninitialized memory in places where we have control over it. I could imagine a third party project having UBSAN warnings and then tracing the origin of them to something in Arrow that they then have to work around. As for the potential performance implications,

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Micah Kornfield
Hi Ben and Ben, I think it would be good to have a convention for by default filling null slots in arrays with known value. I think it might be a mistake to use zero as the value because it can lead to reliance on this behavior. Secure by default is a good approach to take. For kernels in partic

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Antoine Pitrou
I am definitely in the camp that we should not leak past data through uninitialized Arrow memory (for example by transmitting such buffers using Arrow IPC). Regards Antoine. Le 20/02/2021 à 21:17, Benjamin Kietzman a écrit : > Original discussion at > https://github.com/apache/arrow/pull/9471

Re: [DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Jorge Cardoso Leitão
I agree. Below are two notes from a similar discussion on the Rust implementation: 1. In SIMD, for performance reasons, operations are performed over the whole buffer irrespectively of the bitmap mask, and deal with the bitmap mask separately. If a slot contains an arbitrary value, the operation

[DISCUSS] Conventions for values masked by null bits

2021-02-20 Thread Benjamin Kietzman
Original discussion at https://github.com/apache/arrow/pull/9471#issuecomment-779944257 (PR for https://issues.apache.org/jira/browse/ARROW-11595 ) Although the format does not specify what is contained in array slots masked by null bits (for example the first byte in the data buffer of an int8 ar