Would it perhaps make sense to define the total order for non-numbers (NaN, Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the behavior is the same across all languages?
On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <andygrov...@gmail.com> wrote: > Hi Jörn, > > I agree with your concerns about NaN. There was a discussion about this in > https://github.com/apache/arrow/pull/7193 > > I will try and make time this weekend to look at the current implementation > and your suggestions around DictionaryArray. > > Hopefully, other contributors that are more familiar with that code can > respond as well. > > Thanks, > > Andy. > > > > On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann < > joern.horstm...@signavio.com> > wrote: > > > I ran into a few issues with the rust sort kernels and would like to > > discuss possible solutions. > > > > 1. When sorting by multiple columns (lexsort_to_indices) the Float32 > > and Float64 data types are not supported because the implementation > > relies on the OrdArray trait. This trait is not implemented because > > f64/f32 only implements PartialOrd. The sort function for a single > > column (sort_to_indices) has some special logic which looks like it > > wants to treats NaN the same as null, but I'm also not convinced this > > is the correct way. For example postgres does the following > > (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT > ) > > > > "In order to allow floating-point values to be sorted and used in > > tree-based indexes, PostgreSQL treats NaN values as equal, and greater > > than all non-NaN values." > > > > I propose to do the same in an OrdArray impl for > > Float64Array/Float32Array and then simplifying the sort_to_indices > > function accordingly. > > > > 2. Sorting for dictionary encoded strings. The problem here is that > > DictionaryArray does not have a generic parameter for the value type > > so it is not currently possible to only implement OrdArray for string > > dictionaries. Again for the single column case, the value data type > > could be checked and a sort could be implemented by looking up each > > key in the dictionary. An optimization could be to check the is_sorted > > flag of DictionaryArray (which does not seem to be used really) and > > then directly sort by the keys. For the general case I see roughly to > > options > > > > - Somehow implement an OrdArray view of the dictionary array. This > > could be easier if OrdArray did not extend Array but was a completely > > separate trait. > > - Change the lexicographic sort impl to not use dynamic calls but > > instead sort multiple times. So for a query `ORDER BY a, b`, first > > sort by b and afterwards sort again by a. With a stable sort > > implementation this should result in the same ordering. I'm curious > > about the performance, it could avoid dynamic method calls for each > > comparison, but it would process the indices vector multiple times. > > > > Looking forward to any other suggestions or feedback. > > > > -- > > Jörn Horstmann | Senior Backend Engineer > > > > www.signavio.com > > Kurfürstenstraße 111, 10787 Berlin, Germany > > >