Re: Rust/Datafusion sort kernel issues

Matthias Vallentin Wed, 02 Sep 2020 02:41:57 -0700

Would it perhaps make sense to define the total order for non-numbers (NaN,
Inf, -Inf) globally (i.e., in the spec or in Arrow itself) so that the
behavior is the same across all languages?


On Fri, Aug 28, 2020 at 7:42 PM Andy Grove <andygrov...@gmail.com> wrote:

> Hi Jörn,
>
> I agree with your concerns about NaN. There was a discussion about this in
> https://github.com/apache/arrow/pull/7193
>
> I will try and make time this weekend to look at the current implementation
> and your suggestions around DictionaryArray.
>
> Hopefully, other contributors that are more familiar with that code can
> respond as well.
>
> Thanks,
>
> Andy.
>
>
>
> On Fri, Aug 28, 2020 at 8:34 AM Jörn Horstmann <
> joern.horstm...@signavio.com>
> wrote:
>
> > I ran into a few issues with the rust sort kernels and would like to
> > discuss possible solutions.
> >
> > 1. When sorting by multiple columns (lexsort_to_indices) the Float32
> > and Float64 data types are not supported because the implementation
> > relies on the OrdArray trait. This trait is not implemented because
> > f64/f32 only implements PartialOrd. The sort function for a single
> > column (sort_to_indices) has some special logic which looks like it
> > wants to treats NaN the same as null, but I'm also not convinced this
> > is the correct way. For example postgres does the following
> > (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT
> )
> >
> > "In order to allow floating-point values to be sorted and used in
> > tree-based indexes, PostgreSQL treats NaN values as equal, and greater
> > than all non-NaN values."
> >
> > I propose to do the same in an OrdArray impl for
> > Float64Array/Float32Array and then simplifying the sort_to_indices
> > function accordingly.
> >
> > 2. Sorting for dictionary encoded strings. The problem here is that
> > DictionaryArray does not have a generic parameter for the value type
> > so it is not currently possible to only implement OrdArray for string
> > dictionaries. Again for the single column case, the value data type
> > could be checked and a sort could be implemented by looking up each
> > key in the dictionary. An optimization could be to check the is_sorted
> > flag of DictionaryArray (which does not seem to be used really) and
> > then directly sort by the keys. For the general case I see roughly to
> > options
> >
> > - Somehow implement an OrdArray view of the dictionary array. This
> > could be easier if OrdArray did not extend Array but was a completely
> > separate trait.
> > - Change the lexicographic sort impl to not use dynamic calls but
> > instead sort multiple times. So for a query `ORDER BY a, b`, first
> > sort by b and afterwards sort again by a. With a stable sort
> > implementation this should result in the same ordering. I'm curious
> > about the performance, it could avoid dynamic method calls for each
> > comparison, but it would process the indices vector multiple times.
> >
> > Looking forward to any other suggestions or feedback.
> >
> > --
> > Jörn Horstmann | Senior Backend Engineer
> >
> > www.signavio.com
> > Kurfürstenstraße 111, 10787 Berlin, Germany
> >
>

Re: Rust/Datafusion sort kernel issues

Reply via email to