[ 
https://issues.apache.org/jira/browse/ARROW-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282886#comment-17282886
 ] 

Michal Nowakiewicz edited comment on ARROW-10899 at 2/11/21, 7:10 AM:
----------------------------------------------------------------------

I don't have any good advice on open source radix sort, but let me share some 
general thoughts about radix sort implementation:
 # Regarding SIMD. It seems to me that unstable radix sort should be relatively 
simple to implement using AVX-512 instruction set (AVX2 is a bit more 
challenging, since it is missing for instance scatter instruction for memory 
stores). Stable radix sort is a bit harder with SIMD but still doable. Is it 
worth it? I believe that the majority of the cost in radix sort is in the data 
movement (memory stores to non-consecutive addresses), especially if the data 
does not fit into CPU cache. SIMD will not help with that. But it will cut the 
cost of everything else. I would expect visible improvement but nothing close 
to 2x speedup. It's probably something better left for later.
 # Regarding papers. [http://www.cs.columbia.edu/~orestis/sigmod14I.pdf] - This 
paper may be a bit too much - it focuses on SIMD, parallel processing, 
optimizing for multiple NUMA nodes and it goes into a lot of depth and touches 
some other topics as well - but at the time it was published it looked to me 
like a pretty good implementation. There may be something interesting in their 
results and conclusions, even though they tested their solution on much bigger 
data sets.
 # Regarding MSB (most significant bits first) vs LSB (least significant bits 
first) radix sort. MSB should be faster on larger data (compared to the size of 
CPU cache), more cache-friendly, and can be implemented as a stable sort.
 # Regarding hybrid sorting. I imagine that a good solution would be to run MSB 
radix sort at first, producing a set of data partitions based on the highest 
bits of the sort key. Once the partitions get small enough (e.g. fit L1 cache) 
I would guess that radix sort is no longer the best solution for sorting data 
(it should get more advantage over comparison based sorting when the input set 
is large). At that point a different sorting algorithm could be better suited, 
so any sort_of_choice can be called on the data within radix partition. If 
radix sort is faster on small arrays (thousands of elements) - great - but if 
it isn't, then using another sort method for small partitions seems like a good 
idea.
 # Regarding data distribution. I think that data distribution is pretty 
important in designing and testing radix sort. For instance, if we sort 1 
million 64-bit uniformly distributed integers, even though the space of 
potential values is huge, there are at most 1 million unique values. If I had 
to sort 1M of 64-bit integers I would consider doing it using range 
partitioning: a) use a small sample of data (e.g. every Nth value for a double 
digit N) to compute a histogram (sort the sample, using any kind of sort, 
divide into K equal size buckets after sorting for range partitioning later 
into K ranges); b) use range partitioning based on the histogram to split data 
into multiple ranges; c) continue the two previous steps recursively until the 
partition size is small and then call a sort_of_choice function on that 
partition. Histogram based range partitioning should take care of data skew and 
create approximately equal sized partitions. That means that for 1M values, in 
order to get down to 1K elements partitions, about 1K partitions need to be 
generated. That can be done for instance as a single pass of range partitioning 
with 1024 partitions or two passes with 32 partitions each. Even though a 
single round of range partitioning (bucket sort) is obviously more expensive 
than pure radix sort pass (extra histogram and finding a range for each value 
given range boundaries) it seems to me like a good candidate, since it lowers 
the total number of partitioning steps that are needed (compared to pure radix 
sort of 64-bit integers that would use something like 8 passes of 8-bit radix 
partitioning phases). And SIMD can definitely help performance-wise in finding 
the right bucket for each value using range boundaries.


was (Author: michalno):
I don't have any good advice on open source radix sort, but let me share some 
general thoughts about radix sort implementation:
 # Regarding SIMD. It seems to me that unstable radix sort should be relatively 
simple to implement using AVX-512 instruction set (AVX2 is a bit more 
challenging, since it is missing for instance scatter instruction for memory 
stores). Stable radix sort is a bit harder with SIMD but still doable. Is it 
worth it? I believe that the majority of the cost in radix sort is in the data 
movement (memory stores to non-consecutive addresses), especially if the data 
does not fit into CPU cache. SIMD will not help with that. But it will cut the 
cost of everything else. I would expect visible improvement but nothing close 
to 2x speedup. It's probably something better left for later.
 # Regarding papers. [http://www.cs.columbia.edu/~orestis/sigmod14I.pdf] - This 
paper may be a bit too much - it focuses on SIMD, parallel processing, 
optimizing for multiple NUMA nodes and it goes into a lot of depth and touches 
some other topics as well - but at the time it was published it looked to me 
like a pretty good implementation. There may be something interesting in their 
results and conclusions, even though they tested their solution on much bigger 
data sets.


 # Regarding MSB (most significant bits first) vs LSB (least significant bits 
first) radix sort. MSB should be faster on larger data (compared to the size of 
CPU cache), more cache-friendly, and can be implemented as a stable sort. 


 # Regarding hybrid sorting. I imagine that a good solution would be to run MSB 
radix sort at first, producing a set of data partitions based on the highest 
bits of the sort key. Once the partitions get small enough (e.g. fit L1 cache) 
I would guess that radix sort is no longer the best solution for sorting data 
(it should get more advantage over comparison based sorting when the input set 
is large). At that point a different sorting algorithm could be better suited, 
so any sort_of_choice can be called on the data within radix partition. If 
radix sort is faster on small arrays (thousands of elements) - great - but if 
it isn't, then using another sort method for small partitions seems like a good 
idea.


 # Regarding data distribution. I think that data distribution is pretty 
important in designing and testing radix sort. For instance, if we sort 1 
million 64-bit uniformly distributed integers, even though the space of 
potential values is huge, there are at most 1 million unique values. If I had 
to sort 1M of 64-bit integers I would consider doing it using range 
partitioning: a) use a small sample of data (e.g. every Nth value for a double 
digit N) to compute a histogram (sort the sample, using any kind of sort, 
divide into K equal size buckets after sorting for range partitioning later 
into K ranges); b) use range partitioning based on the histogram to split data 
into multiple ranges; c) continue the two previous steps recursively until the 
partition size is small and then call a sort_of_choice function on that 
partition. Histogram based range partitioning should take care of data skew and 
create approximately equal sized partitions. That means that for 1M values, in 
order to get down to 1K elements partitions, about 1K partitions need to be 
generated. That can be done for instance as a single pass of range partitioning 
with 1024 partitions or two passes with 32 partitions each. Even though a 
single round of range partitioning (bucket sort) is obviously more expensive 
than pure radix sort pass (extra histogram and finding a range for each value 
given range boundaries) it seems to me like a good candidate, since it lowers 
the total number of partitioning steps that are needed (compared to pure radix 
sort of 64-bit integers that would use something like 8 passes of 8-bit radix 
partitioning phases). And SIMD can definitely help performance-wise in finding 
the right bucket for each value using range boundaries.

> [C++] Investigate radix sort for integer arrays
> -----------------------------------------------
>
>                 Key: ARROW-10899
>                 URL: https://issues.apache.org/jira/browse/ARROW-10899
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>            Reporter: Antoine Pitrou
>            Assignee: Kirill Lykov
>            Priority: Major
>         Attachments: Screen Shot 2021-02-09 at 17.48.13.png, Screen Shot 
> 2021-02-10 at 10.58.23.png
>
>
> For integer arrays with a non-tiny range of values, we currently use a stable 
> sort. It may be faster to use a radix sort instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to