I think we need to have a hard separation between "data structure equality"
(do these objects contain equivalent data) and "analytical/semantic
equality". The latter is more the domain of pyarrow.compute and I am not
sure we should be overloading dunder methods with compute functions. I
might recommend actually that we remove the compute functions from
Array.__richcmp__ also

Keep in mind that the pyarrow data structures **are not for end users**.
They are intended for developer use and so analytical conveniences should
not outweigh consistency for use by library developers.

On Wed, Jul 1, 2020, 9:03 AM Maarten Breddels <maartenbredd...@gmail.com>
wrote:

> I think that if __eq__ does not return True/False exclusively, __bool__
> should raise an exception to avoid these unexpected truthies. Python users
> are used to that due to Numpy.
>
>
> Op wo 1 jul. 2020 om 15:40 schreef Joris Van den Bossche <
> jorisvandenboss...@gmail.com>:
>
> > On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou <anto...@python.org> wrote:
> >
> > >
> > > Hello,
> > >
> > > Recent changes to PyArrow seem to have taken the stance that comparing
> > > null values should return null.
> >
> >
> > Small note that it is not a *very* recent change (
> > https://github.com/apache/arrow/pull/5330, ARROW-6488
> > <https://issues.apache.org/jira/browse/ARROW-6488>), at least for
> scalars
> > (I am supposing you are talking about scalars in specific? Or also about
> > array equality?).
> >
> >
> > > The problem is that it breaks the
> > > expectation that comparisons should return booleans, and perculates
> into
> > > crazy behaviour in other places.
> >
> >
> > It's certainly true that not returning a boolean from __eq__ gives a
> whole
> > set of surprising/strange behaviour, but doing so would also introduce an
> > inconsistency with array equality (where nulls propagate in comparisons).
> > I think this might boil down to different "expectations" about what
> __eq__
> > should do or is used for:
> >
> > 1) Be the scalar equivalent of element-wise array equality
> (Array.__eq__).
> > Since right now nulls propagate in comparisons, this would lead to
> > Scalar.__eq__ also to return null (and not True/False) if one of the
> > operands is null. The null propagation of array comparisons can also be
> > discussed of course.
> > 2) Be a general equality check of the two objects (matching the
> > ".equals(..)" method). For example, Array.equals considers nulls at the
> > same location in the array as equal.
> > (however, for this second case it's actually not fully clear if nulls
> > evaluate to True or False ..)
> >
> >  Some other inline comments about the the examples:
> >
> > Here is an example of such
> > > misbehaviour in the scalar refactor PR:
> > >
> > > >>> import pyarrow as pa
> > > >>> na = pa.scalar(None)
> > > >>> na == na
> > > <pyarrow.NullScalar: None>
> > >
> > > >>> na == 5
> > > <pyarrow.NullScalar: None>
> > >
> > > >>> bool(na == 5)
> > > True
> > >
> >
> > This could also be changed to raise an error instead. And that way, the
> > next example would also raise (requiring the user to be specific on how
> to
> > handle a scalar null: is it truthy or falsey?)
> >
> >
> > > >>> if na == 5: print("yo!")
> > > yo!
> > >
> > > >>> na in [5]
> > > True
> > >
> > > But you can see it also with arrays containing null values:
> > >
> > > >>> pa.array([1, None]) in [pa.scalar(42)]
> > > True
> > >
> > > Note that this one doesn't follow from the scalar behaviour (the same
> can
> > be seen on master), but from the Array equality returning a boolean array
> > instead of a single boolean value (I changed this recently in
> > https://github.com/apache/arrow/pull/7273, also up for discussion of
> > course).
> > This is because we let any array evaluate to True:
> >
> > >>> bool(pa.array([False]))
> > True
> >
> > which we might want to change to raising an error similarly as numpy does
> > if we keep the element-wise __eq__.
> >
> > I think that Python equality operators should behave in a
> > > Python-sensible way (return True or False).  Have people call another
> > > method if they like the fancy (or noxious, depending on the POV)
> > > semantics of returning null when comparing null with anything.
> > >
> >
> > It would certainly be an option to reserve __eq__ to general object
> > equality (always True/False), and have users call an explicit compute
> > function for one of the element-wise comparison operations (eg
> > pyarrow.compute.equals).
> >
> >
> > >
> > > (note that Numpy doesn't have null scalars, so it can be less
> > > conservative in its customization of equality methods)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> >
>

Reply via email to