On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> Recent changes to PyArrow seem to have taken the stance that comparing
> null values should return null.


Small note that it is not a *very* recent change (
https://github.com/apache/arrow/pull/5330, ARROW-6488
<https://issues.apache.org/jira/browse/ARROW-6488>), at least for scalars
(I am supposing you are talking about scalars in specific? Or also about
array equality?).


> The problem is that it breaks the
> expectation that comparisons should return booleans, and perculates into
> crazy behaviour in other places.


It's certainly true that not returning a boolean from __eq__ gives a whole
set of surprising/strange behaviour, but doing so would also introduce an
inconsistency with array equality (where nulls propagate in comparisons).
I think this might boil down to different "expectations" about what __eq__
should do or is used for:

1) Be the scalar equivalent of element-wise array equality (Array.__eq__).
Since right now nulls propagate in comparisons, this would lead to
Scalar.__eq__ also to return null (and not True/False) if one of the
operands is null. The null propagation of array comparisons can also be
discussed of course.
2) Be a general equality check of the two objects (matching the
".equals(..)" method). For example, Array.equals considers nulls at the
same location in the array as equal.
(however, for this second case it's actually not fully clear if nulls
evaluate to True or False ..)

 Some other inline comments about the the examples:

Here is an example of such
> misbehaviour in the scalar refactor PR:
>
> >>> import pyarrow as pa
> >>> na = pa.scalar(None)
> >>> na == na
> <pyarrow.NullScalar: None>
>
> >>> na == 5
> <pyarrow.NullScalar: None>
>
> >>> bool(na == 5)
> True
>

This could also be changed to raise an error instead. And that way, the
next example would also raise (requiring the user to be specific on how to
handle a scalar null: is it truthy or falsey?)


> >>> if na == 5: print("yo!")
> yo!
>
> >>> na in [5]
> True
>
> But you can see it also with arrays containing null values:
>
> >>> pa.array([1, None]) in [pa.scalar(42)]
> True
>
> Note that this one doesn't follow from the scalar behaviour (the same can
be seen on master), but from the Array equality returning a boolean array
instead of a single boolean value (I changed this recently in
https://github.com/apache/arrow/pull/7273, also up for discussion of
course).
This is because we let any array evaluate to True:

>>> bool(pa.array([False]))
True

which we might want to change to raising an error similarly as numpy does
if we keep the element-wise __eq__.

I think that Python equality operators should behave in a
> Python-sensible way (return True or False).  Have people call another
> method if they like the fancy (or noxious, depending on the POV)
> semantics of returning null when comparing null with anything.
>

It would certainly be an option to reserve __eq__ to general object
equality (always True/False), and have users call an explicit compute
function for one of the element-wise comparison operations (eg
pyarrow.compute.equals).


>
> (note that Numpy doesn't have null scalars, so it can be less
> conservative in its customization of equality methods)
>
> Regards
>
> Antoine.
>

Reply via email to