Re: [DISCUSS] Conventions for values masked by null bits

Neal Richardson Mon, 08 Mar 2021 13:16:58 -0800

Yeah I agree that the general sentinel support has lots of challenges, but
in the more narrow case of "read this parquet or CSV file and return an R
data.frame", the lifetime of the arrays in question is contained.


On Mon, Mar 8, 2021 at 12:42 PM Wes McKinney <[email protected]> wrote:

> It's a bit outside the scope of this discussion, but I've looked at
> those R Jira issues before, and I think the challenge is how the code
> will "know" what fill values are being used. If you start putting
> field-level metadata in a schema object, you're playing a dangerous
> game if that schema gets attached to a record batch / array where the
> same fill value is not being used. The only "safe" way, I think, would
> be to have metadata at the ArrayData level, but I'm not sure that's a
> good idea.
>
> On Mon, Mar 8, 2021 at 1:07 PM Neal Richardson
> <[email protected]> wrote:
> >
> > What was the resolution of this discussion? Was a JIRA made?
> >
> > It occurred to me recently that, if we decided that values masked by null
> > bits need to be filled with a known value, this could open up
> optimizations
> > in some use cases. For example, when reading a file into R, if we could
> > specify what to use for the known null values, we could use R's missing
> > value sentinels and then get pure zero-copy access. Some related JIRAs:
> >
> > https://issues.apache.org/jira/browse/ARROW-8348
> > https://issues.apache.org/jira/browse/ARROW-7767
> > https://issues.apache.org/jira/browse/ARROW-3263
> >
> > Neal
> >
> > On Sat, Feb 20, 2021 at 4:30 PM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Le 21/02/2021 à 01:05, Wes McKinney a écrit :
> > > > I agree that we should avoid leaking uninitialized memory in places
> > > > where we have control over it. I could imagine a third party project
> > > > having UBSAN warnings and then tracing the origin of them to
> something
> > > > in Arrow that they then have to work around. As for the potential
> > > > performance implications, we'll have to be vigilant with
> > > > microbenchmarks.
> > >
> > > We're generally already doing this when we're careful, so we're already
> > > paying the price (which I would estimate intuitively quite small).
> > > Unfortunately, there doesn't seem to be an obvious way to check it
> > > systematically on CI, but Valgrind can occasionally uncover it.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>

Re: [DISCUSS] Conventions for values masked by null bits

Reply via email to