It does make sense, I would go a little further and make this field/property a single value of the same type than the array. This would allow using any arbitrary sentinel value for unknown values (0 in your suggested case). The end result is zero-copy for R bindings (if stars are aligned). I created ARROW-8348 [1] for this.
François [1] https://jira.apache.org/jira/browse/ARROW-8348 On Mon, Apr 6, 2020 at 11:02 AM Felix Benning <felix.benn...@gmail.com> wrote: > > Would it make sense to have an `na_are_zero` flag? Since null checking is > not without cost, it might be helpful to some algorithms, if the content > "underneath" the nulls is zero. For example in means, or scalar products > and thus matrix multiplication, knowing that the array has zeros where the > na's are, would allow these algorithms to pretend that there are no na's. > Since setting all nulls to zero in a matrix of n columns and n rows costs > O(n^2), it would make sense to set them all to zero before matrix > multiplication i.e. O(n^3) and similarly expensive algorithms. If there was > a `na_are_zero` flag, other algorithms could later utilize this work > already being done. Algorithms which change the data and violate this > contract, would only need to reset the flag. And in some use cases, it > might be possible to use idle time of the computer to "clean up" the na's, > preparing for the next query. > > Felix > > ---------- Forwarded message --------- > From: Wes McKinney <wesmck...@gmail.com> > Date: Sun, 5 Apr 2020 at 22:31 > Subject: Re: Attn: Wes, Re: Masked Arrays > To: <u...@arrow.apache.org> > > > As I recall the contents "underneath" have been discussed before and > the consensus was that the contents are not specified. If you'e like > to make a proposal to change something I would suggest raising it on > dev@arrow.apache.org > > On Sun, Apr 5, 2020 at 1:56 PM Felix Benning <felix.benn...@gmail.com> > wrote: > > > > Follow up: Do you think it would make sense to have an `na_are_zero` > flag? Since it appears that the baseline (naively assuming there are no > null values) is still a bit faster than equally optimized null value > handling algorithms. So you might want to make the assumption, that all > null values are set to zero in the array (instead of undefined). This would > allow for very fast means, scalar products and thus matrix multiplication > which ignore nas. And in case of matrix multiplication, you might prefer > sacrificing an O(n^2) effort to set all null entries to zero before > multiplying. And assuming you do not overwrite this data, you would be able > to reuse that assumption in later computations with such a flag. > > In some use cases, you might even be able to utilize unused computing > resources for this task. I.e. clean up the nulls while the computer is not > used, preparing for the next query. > > > > > > On Sun, 5 Apr 2020 at 18:34, Felix Benning <felix.benn...@gmail.com> > wrote: > >> > >> Awesome, that was exactly what I was looking for, thank you! > >> > >> On Sun, 5 Apr 2020 at 00:40, Wes McKinney <wesmck...@gmail.com> wrote: > >>> > >>> I wrote a blog post a couple of years about this > >>> > >>> https://wesmckinney.com/blog/bitmaps-vs-sentinel-values/ > >>> > >>> Pasha Stetsenko did a follow-up analysis that showed that my > >>> "sentinel" code could be significantly improved, see: > >>> > >>> https://github.com/st-pasha/microbench-nas/blob/master/README.md > >>> > >>> Generally speaking in Apache Arrow we've been happy to have a uniform > >>> representation of nullness across all types, both primitive (booleans, > >>> numbers, or strings) and nested (lists, structs, unions, etc.). Many > >>> computational operations (like elementwise functions) need not concern > >>> themselves with the nulls at all, for example, since the bitmap from > >>> the input array can be passed along (with zero copy even) to the > >>> output array. > >>> > >>> On Sat, Apr 4, 2020 at 4:39 PM Felix Benning <felix.benn...@gmail.com> > wrote: > >>> > > >>> > Does anyone have an opinion (or links) about Bitpattern vs Masked > Arrays for NA implementations? There seems to have been a discussion about > that in the numpy community in 2012 > https://numpy.org/neps/nep-0026-missing-data-summary.html without an > apparent result. > >>> > > >>> > Summary of the Summary: > >>> > - The Bitpattern approach reserves one bitpattern of any type as na, > the only type not having spare bitpatterns are integers which means this > decreases their range by one. This approach is taken by R and was regarded > as more performant in 2012. > >>> > - The Mask approach was deemed more flexible, since it would allow > "degrees of missingness", and also cleaner/easier implementation. > >>> > > >>> > Since bitpattern checks would probably disrupt SIMD, I feel like some > calculations (e.g. mean) would actually benefit more, from setting na > values to zero, proceeding as if they were not there, and using the number > of nas in the metadata to adjust the result. This of course does not work > if two columns are used (e.g. scalar product), which is probably more > important. > >>> > > >>> > Was using Bitmasks in Arrow a conscious performance decision? Or was > the decision only based on the fact, that R and Bitpattern implementations > in general are a niche, which means that Bitmasks are more compatible with > other languages? > >>> > > >>> > I am curious about this topic, since the "lack of proper na support" > was cited as the reason, why Python would never replace R in statistics. > >>> > > >>> > Thanks, > >>> > > >>> > Felix > >>> > > >>> > > >>> > On 31.03.20 14:52, Joris Van den Bossche wrote: > >>> > > >>> > Note that pandas is starting to use a notion of "masked arrays" as > well, for example for its nullable integer data type, but also not using > the np.ma masked array, but a custom implementation (for technical reasons > in pandas this was easier). > >>> > > >>> > Also, there has been quite some discussion last year in numpy about a > possible re-implementation of a MaskedArray, but using numpy's protocols > (`__array_ufunc__`, `__array_function__` etc), instead of being a subclass > like np.ma now is. See eg > https://mail.python.org/pipermail/numpy-discussion/2019-June/079681.html. > >>> > > >>> > Joris > >>> > > >>> > On Mon, 30 Mar 2020 at 18:57, Daniel Nugent <nug...@gmail.com> wrote: > >>> >> > >>> >> Ok. That actually aligns closely to what I'm familiar with. Good to > know. > >>> >> > >>> >> Thanks again for taking the time to respond, > >>> >> > >>> >> -Dan Nugent > >>> >> > >>> >> > >>> >> On Mon, Mar 30, 2020 at 12:38 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >>> >>> > >>> >>> Social and technical reasons I guess. Empirically it's just not > used much. > >>> >>> > >>> >>> You can see my comments about numpy.ma in my 2010 paper about pandas > >>> >>> > >>> >>> https://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf > >>> >>> > >>> >>> At least in 2010, there were notable performance problems when using > >>> >>> MaskedArray for computations > >>> >>> > >>> >>> "We chose to use NaN as opposed to using NumPy MaskedArrays for > >>> >>> performance reasons (which are beyond the scope of this paper), as > NaN > >>> >>> propagates in floating-point operations in a natural way and can be > >>> >>> easily detected in algorithms." > >>> >>> > >>> >>> On Mon, Mar 30, 2020 at 11:20 AM Daniel Nugent <nug...@gmail.com> > wrote: > >>> >>> > > >>> >>> > Thanks! Since I'm just using it to jump to Arrow, I think I'll > stick with it. > >>> >>> > > >>> >>> > Do you have any feelings about why Numpy's masked arrays didn't > gain favor when many data representation formats explicitly support nullity > (including Arrow)? Is it just that not carrying nulls in computations > forward is preferable (that is, early filtering/value filling was easier)? > >>> >>> > > >>> >>> > -Dan Nugent > >>> >>> > > >>> >>> > > >>> >>> > On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >>> >>> >> > >>> >>> >> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent <nug...@gmail.com> > wrote: > >>> >>> >> > > >>> >>> >> > Didn’t want to follow up on this on the Jira issue earlier > since it's sort of tangential to that bug and more of a usage question. You > said: > >>> >>> >> > > >>> >>> >> > > I wouldn't recommend building applications based on them > nowadays since the level of support / compatibility in other projects is > low. > >>> >>> >> > > >>> >>> >> > In my case, I am using them since it seemed like a > straightforward representation of my data that has nulls, the format I’m > converting from has zero cost numpy representations, and converting from an > internal format into Arrow in memory structures appears zero cost (or close > to it) as well. I guess I can just provide the mask as an explicit > argument, but my original desire to use it came from being able to exploit > numpy.ma.concatenate in a way that saved some complexity in implementation. > >>> >>> >> > > >>> >>> >> > Since Arrow itself supports masking values with a bitfield, is > there something intrinsic to the notion of array masks that is not well > supported? Or do you just mean the specific numpy MaskedArray class? > >>> >>> >> > > >>> >>> >> > >>> >>> >> I mean just the numpy.ma module. Not many Python computing > projects > >>> >>> >> nowadays treat MaskedArray objects as first class citizens. > Depending > >>> >>> >> on what you need it may or may not be a problem. pyarrow supports > >>> >>> >> ingesting from MaskedArray as a convenience, but it would not be > >>> >>> >> common in my experience for a library's APIs to return > MaskedArrays. > >>> >>> >> > >>> >>> >> > If this is too much of a numpy question rather than an arrow > question, could you point me to where I can read up on masked array support > or maybe what the right place to ask the numpy community about whether what > I'm doing is appropriate or not. > >>> >>> >> > > >>> >>> >> > Thanks, > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > -Dan Nugent