I think we need to have a hard separation between "data structure equality" (do these objects contain equivalent data) and "analytical/semantic equality". The latter is more the domain of pyarrow.compute and I am not sure we should be overloading dunder methods with compute functions. I might recommend actually that we remove the compute functions from Array.__richcmp__ also
Keep in mind that the pyarrow data structures **are not for end users**. They are intended for developer use and so analytical conveniences should not outweigh consistency for use by library developers. On Wed, Jul 1, 2020, 9:03 AM Maarten Breddels <maartenbredd...@gmail.com> wrote: > I think that if __eq__ does not return True/False exclusively, __bool__ > should raise an exception to avoid these unexpected truthies. Python users > are used to that due to Numpy. > > > Op wo 1 jul. 2020 om 15:40 schreef Joris Van den Bossche < > jorisvandenboss...@gmail.com>: > > > On Wed, 1 Jul 2020 at 09:46, Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > Hello, > > > > > > Recent changes to PyArrow seem to have taken the stance that comparing > > > null values should return null. > > > > > > Small note that it is not a *very* recent change ( > > https://github.com/apache/arrow/pull/5330, ARROW-6488 > > <https://issues.apache.org/jira/browse/ARROW-6488>), at least for > scalars > > (I am supposing you are talking about scalars in specific? Or also about > > array equality?). > > > > > > > The problem is that it breaks the > > > expectation that comparisons should return booleans, and perculates > into > > > crazy behaviour in other places. > > > > > > It's certainly true that not returning a boolean from __eq__ gives a > whole > > set of surprising/strange behaviour, but doing so would also introduce an > > inconsistency with array equality (where nulls propagate in comparisons). > > I think this might boil down to different "expectations" about what > __eq__ > > should do or is used for: > > > > 1) Be the scalar equivalent of element-wise array equality > (Array.__eq__). > > Since right now nulls propagate in comparisons, this would lead to > > Scalar.__eq__ also to return null (and not True/False) if one of the > > operands is null. The null propagation of array comparisons can also be > > discussed of course. > > 2) Be a general equality check of the two objects (matching the > > ".equals(..)" method). For example, Array.equals considers nulls at the > > same location in the array as equal. > > (however, for this second case it's actually not fully clear if nulls > > evaluate to True or False ..) > > > > Some other inline comments about the the examples: > > > > Here is an example of such > > > misbehaviour in the scalar refactor PR: > > > > > > >>> import pyarrow as pa > > > >>> na = pa.scalar(None) > > > >>> na == na > > > <pyarrow.NullScalar: None> > > > > > > >>> na == 5 > > > <pyarrow.NullScalar: None> > > > > > > >>> bool(na == 5) > > > True > > > > > > > This could also be changed to raise an error instead. And that way, the > > next example would also raise (requiring the user to be specific on how > to > > handle a scalar null: is it truthy or falsey?) > > > > > > > >>> if na == 5: print("yo!") > > > yo! > > > > > > >>> na in [5] > > > True > > > > > > But you can see it also with arrays containing null values: > > > > > > >>> pa.array([1, None]) in [pa.scalar(42)] > > > True > > > > > > Note that this one doesn't follow from the scalar behaviour (the same > can > > be seen on master), but from the Array equality returning a boolean array > > instead of a single boolean value (I changed this recently in > > https://github.com/apache/arrow/pull/7273, also up for discussion of > > course). > > This is because we let any array evaluate to True: > > > > >>> bool(pa.array([False])) > > True > > > > which we might want to change to raising an error similarly as numpy does > > if we keep the element-wise __eq__. > > > > I think that Python equality operators should behave in a > > > Python-sensible way (return True or False). Have people call another > > > method if they like the fancy (or noxious, depending on the POV) > > > semantics of returning null when comparing null with anything. > > > > > > > It would certainly be an option to reserve __eq__ to general object > > equality (always True/False), and have users call an explicit compute > > function for one of the element-wise comparison operations (eg > > pyarrow.compute.equals). > > > > > > > > > > (note that Numpy doesn't have null scalars, so it can be less > > > conservative in its customization of equality methods) > > > > > > Regards > > > > > > Antoine. > > > > > >