Re: Attn: Wes, Re: Masked Arrays

Francois Saint-Jacques Mon, 06 Apr 2020 09:00:26 -0700

It does make sense, I would go a little further and make this
field/property a single value of the same type than the array. This
would allow using any arbitrary sentinel value for unknown values (0
in your suggested case). The end result is zero-copy for R bindings
(if stars are aligned). I created ARROW-8348 [1] for this.


François

[1] https://jira.apache.org/jira/browse/ARROW-8348

On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <felix.benn...@gmail.com> wrote:
>
> Would it make sense to have an `na_are_zero` flag? Since null checking is
> not without cost, it might be helpful to some algorithms, if the content
> "underneath" the nulls is zero. For example in means, or scalar products
> and thus matrix multiplication, knowing that the array has zeros where the
> na's are, would allow these algorithms to pretend that there are no na's.
> Since setting all nulls to zero in a matrix of n columns and n rows costs
> O(n^2), it would make sense to set them all to zero before matrix
> multiplication i.e. O(n^3) and similarly expensive algorithms. If there was
> a `na_are_zero` flag, other algorithms could later utilize this work
> already being done. Algorithms which change the data and violate this
> contract, would only need to reset the flag. And in some use cases, it
> might be possible to use idle time of the computer to "clean up" the na's,
> preparing for the next query.
>
> Felix
>
> ---------- Forwarded message ---------
> From: Wes McKinney <wesmck...@gmail.com>
> Date: Sun, 5 Apr 2020 at 22:31
> Subject: Re: Attn: Wes, Re: Masked Arrays
> To: <u...@arrow.apache.org>
>
>
> As I recall the contents "underneath" have been discussed before and
> the consensus was that the contents are not specified. If you'e like
> to make a proposal to change something I would suggest raising it on
> dev@arrow.apache.org
>
> On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <felix.benn...@gmail.com>
> wrote:
> >
> > Follow up: Do you think it would make sense to have an `na_are_zero`
> flag? Since it appears that the baseline (naively assuming there are no
> null values) is still a bit faster than equally optimized null value
> handling algorithms. So you might want to make the assumption, that all
> null values are set to zero in the array (instead of undefined). This would
> allow for very fast means, scalar products and thus matrix multiplication
> which ignore nas. And in case of matrix multiplication, you might prefer
> sacrificing an O(n^2) effort to set all null entries to zero before
> multiplying. And assuming you do not overwrite this data, you would be able
> to reuse that assumption in later computations with such a flag.
> > In some use cases, you might even be able to utilize unused computing
> resources for this task. I.e. clean up the nulls while the computer is not
> used, preparing for the next query.
> >
> >
> > On Sun, 5 Apr 2020 at 18:34, Felix Benning <felix.benn...@gmail.com>
> wrote:
> >>
> >> Awesome, that was exactly what I was looking for, thank you!
> >>
> >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <wesmck...@gmail.com> wrote:
> >>>
> >>> I wrote a blog post a couple of years about this
> >>>
> >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/
> >>>
> >>> Pasha Stetsenko did a follow-up analysis that showed that my
> >>> "sentinel" code could be significantly improved, see:
> >>>
> >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md
> >>>
> >>> Generally speaking in Apache Arrow we've been happy to have a uniform
> >>> representation of nullness across all types, both primitive (booleans,
> >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many
> >>> computational operations (like elementwise functions) need not concern
> >>> themselves with the nulls at all, for example, since the bitmap from
> >>> the input array can be passed along (with zero copy even) to the
> >>> output array.
> >>>
> >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <felix.benn...@gmail.com>
> wrote:
> >>> >
> >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked
> Arrays for NA implementations? There seems to have been a discussion about
> that in the numpy community in 2012
> https://numpy.org/neps/nep-0026-missing-data-summary.html without an
> apparent result.
> >>> >
> >>> > Summary of the Summary:
> >>> > - The Bitpattern approach reserves one bitpattern of any type as na,
> the only type not having spare bitpatterns are integers which means this
> decreases their range by one. This approach is taken by R and was regarded
> as more performant in 2012.
> >>> > - The Mask approach was deemed more flexible, since it would allow
> "degrees of missingness", and also cleaner/easier implementation.
> >>> >
> >>> > Since bitpattern checks would probably disrupt SIMD, I feel like some
> calculations (e.g. mean) would actually benefit more, from setting na
> values to zero, proceeding as if they were not there, and using the number
> of nas in the metadata to adjust the result. This of course does not work
> if two columns are used (e.g. scalar product), which is probably more
> important.
> >>> >
> >>> > Was using Bitmasks in Arrow a conscious performance decision? Or was
> the decision only based on the fact, that R and Bitpattern implementations
> in general are a niche, which means that Bitmasks are more compatible with
> other languages?
> >>> >
> >>> > I am curious about this topic, since the "lack of proper na support"
> was cited as the reason, why Python would never replace R in statistics.
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Felix
> >>> >
> >>> >
> >>> > On 31.03.20 14:52, Joris Van den Bossche wrote:
> >>> >
> >>> > Note that pandas is starting to use a notion of "masked arrays" as
> well, for example for its nullable integer data type, but also not using
> the np.ma masked array, but a custom implementation (for technical reasons
> in pandas this was easier).
> >>> >
> >>> > Also, there has been quite some discussion last year in numpy about a
> possible re-implementation of a MaskedArray, but using numpy's protocols
> (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass
> like np.ma now is. See eg
> https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html.
> >>> >
> >>> > Joris
> >>> >
> >>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nug...@gmail.com> wrote:
> >>> >>
> >>> >> Ok. That actually aligns closely to what I'm familiar with. Good to
> know.
> >>> >>
> >>> >> Thanks again for taking the time to respond,
> >>> >>
> >>> >> -Dan Nugent
> >>> >>
> >>> >>
> >>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>> >>>
> >>> >>> Social and technical reasons I guess. Empirically it's just not
> used much.
> >>> >>>
> >>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas
> >>> >>>
> >>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf
> >>> >>>
> >>> >>> At least in 2010, there were notable performance problems when using
> >>> >>> MaskedArray for computations
> >>> >>>
> >>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for
> >>> >>> performance reasons (which are beyond the scope of this paper), as
> NaN
> >>> >>> propagates in floating-point operations in a natural way and can be
> >>> >>> easily detected in algorithms."
> >>> >>>
> >>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nug...@gmail.com>
> wrote:
> >>> >>> >
> >>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll
> stick with it.
> >>> >>> >
> >>> >>> > Do you have any feelings about why Numpy's masked arrays didn't
> gain favor when many data representation formats explicitly support nullity
> (including Arrow)? Is it just that not carrying nulls in computations
> forward is preferable (that is, early filtering/value filling was easier)?
> >>> >>> >
> >>> >>> > -Dan Nugent
> >>> >>> >
> >>> >>> >
> >>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>> >>> >>
> >>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nug...@gmail.com>
> wrote:
> >>> >>> >> >
> >>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier
> since it's sort of tangential to that bug and more of a usage question. You
> said:
> >>> >>> >> >
> >>> >>> >> > > I wouldn't recommend building applications based on them
> nowadays since the level of support / compatibility in other projects is
> low.
> >>> >>> >> >
> >>> >>> >> > In my case, I am using them since it seemed like a
> straightforward representation of my data that has nulls, the format I’m
> converting from has zero cost numpy representations, and converting from an
> internal format into Arrow in memory structures appears zero cost (or close
> to it) as well. I guess I can just provide the mask as an explicit
> argument, but my original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >>> >>> >> >
> >>> >>> >> > Since Arrow itself supports masking values with a bitfield, is
> there something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >>> >>> >> >
> >>> >>> >>
> >>> >>> >> I mean just the numpy.ma module. Not many Python computing
> projects
> >>> >>> >> nowadays treat MaskedArray objects as first class citizens.
> Depending
> >>> >>> >> on what you need it may or may not be a problem. pyarrow supports
> >>> >>> >> ingesting from MaskedArray as a convenience, but it would not be
> >>> >>> >> common in my experience for a library's APIs to return
> MaskedArrays.
> >>> >>> >>
> >>> >>> >> > If this is too much of a numpy question rather than an arrow
> question, could you point me to where I can read up on masked array support
> or maybe what the right place to ask the numpy community about whether what
> I'm doing is appropriate or not.
> >>> >>> >> >
> >>> >>> >> > Thanks,
> >>> >>> >> >
> >>> >>> >> >
> >>> >>> >> > -Dan Nugent

Re: Attn: Wes, Re: Masked Arrays

Reply via email to