I am personally fine with removing the compute dunder methods again (i.e.
Array.__richcmp__), if that resolves the ambiguity. Although they *are*
convenient IMO, even for developers (question might also come up if we want
to add __add__, __sub__ etc, though). So it could also be an option to say
that for "data structure equality", one should use the ".equals(..)"
method, and not rely on __eq__.

If we use __eq__ for data structure equality, there is still the question
of what "null == null" should return: True or False? Although somewhat
counter-intuitive, it should probably then return True? (given that
"pa.array([null]) == pa.array([null])" would give True, and I suppose the
C++ Equals method will also return True).

On Wed, 1 Jul 2020 at 16:30, Wes McKinney <wesmck...@gmail.com> wrote:

> I think we need to have a hard separation between "data structure equality"
> (do these objects contain equivalent data) and "analytical/semantic
> equality". The latter is more the domain of pyarrow.compute and I am not
> sure we should be overloading dunder methods with compute functions. I
> might recommend actually that we remove the compute functions from
> Array.__richcmp__ also
>
> Keep in mind that the pyarrow data structures **are not for end users**.
> They are intended for developer use and so analytical conveniences should
> not outweigh consistency for use by library developers.
>
> On Wed, Jul 1, 2020, 9:03 AM Maarten Breddels <maartenbredd...@gmail.com>
> wrote:
>
> > I think that if __eq__ does not return True/False exclusively, __bool__
> > should raise an exception to avoid these unexpected truthies. Python
> users
> > are used to that due to Numpy.
> >
> >
> > Op wo 1 jul. 2020 om 15:40 schreef Joris Van den Bossche <
> > jorisvandenboss...@gmail.com>:
> >
> > > On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou <anto...@python.org>
> wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > Recent changes to PyArrow seem to have taken the stance that
> comparing
> > > > null values should return null.
> > >
> > >
> > > Small note that it is not a *very* recent change (
> > > https://github.com/apache/arrow/pull/5330, ARROW-6488
> > > <https://issues.apache.org/jira/browse/ARROW-6488>), at least for
> > scalars
> > > (I am supposing you are talking about scalars in specific? Or also
> about
> > > array equality?).
> > >
> > >
> > > > The problem is that it breaks the
> > > > expectation that comparisons should return booleans, and perculates
> > into
> > > > crazy behaviour in other places.
> > >
> > >
> > > It's certainly true that not returning a boolean from __eq__ gives a
> > whole
> > > set of surprising/strange behaviour, but doing so would also introduce
> an
> > > inconsistency with array equality (where nulls propagate in
> comparisons).
> > > I think this might boil down to different "expectations" about what
> > __eq__
> > > should do or is used for:
> > >
> > > 1) Be the scalar equivalent of element-wise array equality
> > (Array.__eq__).
> > > Since right now nulls propagate in comparisons, this would lead to
> > > Scalar.__eq__ also to return null (and not True/False) if one of the
> > > operands is null. The null propagation of array comparisons can also be
> > > discussed of course.
> > > 2) Be a general equality check of the two objects (matching the
> > > ".equals(..)" method). For example, Array.equals considers nulls at the
> > > same location in the array as equal.
> > > (however, for this second case it's actually not fully clear if nulls
> > > evaluate to True or False ..)
> > >
> > >  Some other inline comments about the the examples:
> > >
> > > Here is an example of such
> > > > misbehaviour in the scalar refactor PR:
> > > >
> > > > >>> import pyarrow as pa
> > > > >>> na = pa.scalar(None)
> > > > >>> na == na
> > > > <pyarrow.NullScalar: None>
> > > >
> > > > >>> na == 5
> > > > <pyarrow.NullScalar: None>
> > > >
> > > > >>> bool(na == 5)
> > > > True
> > > >
> > >
> > > This could also be changed to raise an error instead. And that way, the
> > > next example would also raise (requiring the user to be specific on how
> > to
> > > handle a scalar null: is it truthy or falsey?)
> > >
> > >
> > > > >>> if na == 5: print("yo!")
> > > > yo!
> > > >
> > > > >>> na in [5]
> > > > True
> > > >
> > > > But you can see it also with arrays containing null values:
> > > >
> > > > >>> pa.array([1, None]) in [pa.scalar(42)]
> > > > True
> > > >
> > > > Note that this one doesn't follow from the scalar behaviour (the same
> > can
> > > be seen on master), but from the Array equality returning a boolean
> array
> > > instead of a single boolean value (I changed this recently in
> > > https://github.com/apache/arrow/pull/7273, also up for discussion of
> > > course).
> > > This is because we let any array evaluate to True:
> > >
> > > >>> bool(pa.array([False]))
> > > True
> > >
> > > which we might want to change to raising an error similarly as numpy
> does
> > > if we keep the element-wise __eq__.
> > >
> > > I think that Python equality operators should behave in a
> > > > Python-sensible way (return True or False).  Have people call another
> > > > method if they like the fancy (or noxious, depending on the POV)
> > > > semantics of returning null when comparing null with anything.
> > > >
> > >
> > > It would certainly be an option to reserve __eq__ to general object
> > > equality (always True/False), and have users call an explicit compute
> > > function for one of the element-wise comparison operations (eg
> > > pyarrow.compute.equals).
> > >
> > >
> > > >
> > > > (note that Numpy doesn't have null scalars, so it can be less
> > > > conservative in its customization of equality methods)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > >
> >
>

Reply via email to