On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou <anto...@python.org> wrote:
> > Hello, > > Recent changes to PyArrow seem to have taken the stance that comparing > null values should return null. Small note that it is not a *very* recent change ( https://github.com/apache/arrow/pull/5330, ARROW-6488 <https://issues.apache.org/jira/browse/ARROW-6488>), at least for scalars (I am supposing you are talking about scalars in specific? Or also about array equality?). > The problem is that it breaks the > expectation that comparisons should return booleans, and perculates into > crazy behaviour in other places. It's certainly true that not returning a boolean from __eq__ gives a whole set of surprising/strange behaviour, but doing so would also introduce an inconsistency with array equality (where nulls propagate in comparisons). I think this might boil down to different "expectations" about what __eq__ should do or is used for: 1) Be the scalar equivalent of element-wise array equality (Array.__eq__). Since right now nulls propagate in comparisons, this would lead to Scalar.__eq__ also to return null (and not True/False) if one of the operands is null. The null propagation of array comparisons can also be discussed of course. 2) Be a general equality check of the two objects (matching the ".equals(..)" method). For example, Array.equals considers nulls at the same location in the array as equal. (however, for this second case it's actually not fully clear if nulls evaluate to True or False ..) Some other inline comments about the the examples: Here is an example of such > misbehaviour in the scalar refactor PR: > > >>> import pyarrow as pa > >>> na = pa.scalar(None) > >>> na == na > <pyarrow.NullScalar: None> > > >>> na == 5 > <pyarrow.NullScalar: None> > > >>> bool(na == 5) > True > This could also be changed to raise an error instead. And that way, the next example would also raise (requiring the user to be specific on how to handle a scalar null: is it truthy or falsey?) > >>> if na == 5: print("yo!") > yo! > > >>> na in [5] > True > > But you can see it also with arrays containing null values: > > >>> pa.array([1, None]) in [pa.scalar(42)] > True > > Note that this one doesn't follow from the scalar behaviour (the same can be seen on master), but from the Array equality returning a boolean array instead of a single boolean value (I changed this recently in https://github.com/apache/arrow/pull/7273, also up for discussion of course). This is because we let any array evaluate to True: >>> bool(pa.array([False])) True which we might want to change to raising an error similarly as numpy does if we keep the element-wise __eq__. I think that Python equality operators should behave in a > Python-sensible way (return True or False). Have people call another > method if they like the fancy (or noxious, depending on the POV) > semantics of returning null when comparing null with anything. > It would certainly be an option to reserve __eq__ to general object equality (always True/False), and have users call an explicit compute function for one of the element-wise comparison operations (eg pyarrow.compute.equals). > > (note that Numpy doesn't have null scalars, so it can be less > conservative in its customization of equality methods) > > Regards > > Antoine. >