It's a bit outside the scope of this discussion, but I've looked at those R Jira issues before, and I think the challenge is how the code will "know" what fill values are being used. If you start putting field-level metadata in a schema object, you're playing a dangerous game if that schema gets attached to a record batch / array where the same fill value is not being used. The only "safe" way, I think, would be to have metadata at the ArrayData level, but I'm not sure that's a good idea.
On Mon, Mar 8, 2021 at 1:07 PM Neal Richardson <neal.p.richard...@gmail.com> wrote: > > What was the resolution of this discussion? Was a JIRA made? > > It occurred to me recently that, if we decided that values masked by null > bits need to be filled with a known value, this could open up optimizations > in some use cases. For example, when reading a file into R, if we could > specify what to use for the known null values, we could use R's missing > value sentinels and then get pure zero-copy access. Some related JIRAs: > > https://issues.apache.org/jira/browse/ARROW-8348 > https://issues.apache.org/jira/browse/ARROW-7767 > https://issues.apache.org/jira/browse/ARROW-3263 > > Neal > > On Sat, Feb 20, 2021 at 4:30 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 21/02/2021 à 01:05, Wes McKinney a écrit : > > > I agree that we should avoid leaking uninitialized memory in places > > > where we have control over it. I could imagine a third party project > > > having UBSAN warnings and then tracing the origin of them to something > > > in Arrow that they then have to work around. As for the potential > > > performance implications, we'll have to be vigilant with > > > microbenchmarks. > > > > We're generally already doing this when we're careful, so we're already > > paying the price (which I would estimate intuitively quite small). > > Unfortunately, there doesn't seem to be an obvious way to check it > > systematically on CI, but Valgrind can occasionally uncover it. > > > > Regards > > > > Antoine. > >