[ https://issues.apache.org/jira/browse/ARROW-10899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282886#comment-17282886 ]
Michal Nowakiewicz edited comment on ARROW-10899 at 2/11/21, 7:10 AM: ---------------------------------------------------------------------- I don't have any good advice on open source radix sort, but let me share some general thoughts about radix sort implementation: # Regarding SIMD. It seems to me that unstable radix sort should be relatively simple to implement using AVX-512 instruction set (AVX2 is a bit more challenging, since it is missing for instance scatter instruction for memory stores). Stable radix sort is a bit harder with SIMD but still doable. Is it worth it? I believe that the majority of the cost in radix sort is in the data movement (memory stores to non-consecutive addresses), especially if the data does not fit into CPU cache. SIMD will not help with that. But it will cut the cost of everything else. I would expect visible improvement but nothing close to 2x speedup. It's probably something better left for later. # Regarding papers. [http://www.cs.columbia.edu/~orestis/sigmod14I.pdf] - This paper may be a bit too much - it focuses on SIMD, parallel processing, optimizing for multiple NUMA nodes and it goes into a lot of depth and touches some other topics as well - but at the time it was published it looked to me like a pretty good implementation. There may be something interesting in their results and conclusions, even though they tested their solution on much bigger data sets. # Regarding MSB (most significant bits first) vs LSB (least significant bits first) radix sort. MSB should be faster on larger data (compared to the size of CPU cache), more cache-friendly, and can be implemented as a stable sort. # Regarding hybrid sorting. I imagine that a good solution would be to run MSB radix sort at first, producing a set of data partitions based on the highest bits of the sort key. Once the partitions get small enough (e.g. fit L1 cache) I would guess that radix sort is no longer the best solution for sorting data (it should get more advantage over comparison based sorting when the input set is large). At that point a different sorting algorithm could be better suited, so any sort_of_choice can be called on the data within radix partition. If radix sort is faster on small arrays (thousands of elements) - great - but if it isn't, then using another sort method for small partitions seems like a good idea. # Regarding data distribution. I think that data distribution is pretty important in designing and testing radix sort. For instance, if we sort 1 million 64-bit uniformly distributed integers, even though the space of potential values is huge, there are at most 1 million unique values. If I had to sort 1M of 64-bit integers I would consider doing it using range partitioning: a) use a small sample of data (e.g. every Nth value for a double digit N) to compute a histogram (sort the sample, using any kind of sort, divide into K equal size buckets after sorting for range partitioning later into K ranges); b) use range partitioning based on the histogram to split data into multiple ranges; c) continue the two previous steps recursively until the partition size is small and then call a sort_of_choice function on that partition. Histogram based range partitioning should take care of data skew and create approximately equal sized partitions. That means that for 1M values, in order to get down to 1K elements partitions, about 1K partitions need to be generated. That can be done for instance as a single pass of range partitioning with 1024 partitions or two passes with 32 partitions each. Even though a single round of range partitioning (bucket sort) is obviously more expensive than pure radix sort pass (extra histogram and finding a range for each value given range boundaries) it seems to me like a good candidate, since it lowers the total number of partitioning steps that are needed (compared to pure radix sort of 64-bit integers that would use something like 8 passes of 8-bit radix partitioning phases). And SIMD can definitely help performance-wise in finding the right bucket for each value using range boundaries. was (Author: michalno): I don't have any good advice on open source radix sort, but let me share some general thoughts about radix sort implementation: # Regarding SIMD. It seems to me that unstable radix sort should be relatively simple to implement using AVX-512 instruction set (AVX2 is a bit more challenging, since it is missing for instance scatter instruction for memory stores). Stable radix sort is a bit harder with SIMD but still doable. Is it worth it? I believe that the majority of the cost in radix sort is in the data movement (memory stores to non-consecutive addresses), especially if the data does not fit into CPU cache. SIMD will not help with that. But it will cut the cost of everything else. I would expect visible improvement but nothing close to 2x speedup. It's probably something better left for later. # Regarding papers. [http://www.cs.columbia.edu/~orestis/sigmod14I.pdf] - This paper may be a bit too much - it focuses on SIMD, parallel processing, optimizing for multiple NUMA nodes and it goes into a lot of depth and touches some other topics as well - but at the time it was published it looked to me like a pretty good implementation. There may be something interesting in their results and conclusions, even though they tested their solution on much bigger data sets. # Regarding MSB (most significant bits first) vs LSB (least significant bits first) radix sort. MSB should be faster on larger data (compared to the size of CPU cache), more cache-friendly, and can be implemented as a stable sort. # Regarding hybrid sorting. I imagine that a good solution would be to run MSB radix sort at first, producing a set of data partitions based on the highest bits of the sort key. Once the partitions get small enough (e.g. fit L1 cache) I would guess that radix sort is no longer the best solution for sorting data (it should get more advantage over comparison based sorting when the input set is large). At that point a different sorting algorithm could be better suited, so any sort_of_choice can be called on the data within radix partition. If radix sort is faster on small arrays (thousands of elements) - great - but if it isn't, then using another sort method for small partitions seems like a good idea. # Regarding data distribution. I think that data distribution is pretty important in designing and testing radix sort. For instance, if we sort 1 million 64-bit uniformly distributed integers, even though the space of potential values is huge, there are at most 1 million unique values. If I had to sort 1M of 64-bit integers I would consider doing it using range partitioning: a) use a small sample of data (e.g. every Nth value for a double digit N) to compute a histogram (sort the sample, using any kind of sort, divide into K equal size buckets after sorting for range partitioning later into K ranges); b) use range partitioning based on the histogram to split data into multiple ranges; c) continue the two previous steps recursively until the partition size is small and then call a sort_of_choice function on that partition. Histogram based range partitioning should take care of data skew and create approximately equal sized partitions. That means that for 1M values, in order to get down to 1K elements partitions, about 1K partitions need to be generated. That can be done for instance as a single pass of range partitioning with 1024 partitions or two passes with 32 partitions each. Even though a single round of range partitioning (bucket sort) is obviously more expensive than pure radix sort pass (extra histogram and finding a range for each value given range boundaries) it seems to me like a good candidate, since it lowers the total number of partitioning steps that are needed (compared to pure radix sort of 64-bit integers that would use something like 8 passes of 8-bit radix partitioning phases). And SIMD can definitely help performance-wise in finding the right bucket for each value using range boundaries. > [C++] Investigate radix sort for integer arrays > ----------------------------------------------- > > Key: ARROW-10899 > URL: https://issues.apache.org/jira/browse/ARROW-10899 > Project: Apache Arrow > Issue Type: Wish > Components: C++ > Reporter: Antoine Pitrou > Assignee: Kirill Lykov > Priority: Major > Attachments: Screen Shot 2021-02-09 at 17.48.13.png, Screen Shot > 2021-02-10 at 10.58.23.png > > > For integer arrays with a non-tiny range of values, we currently use a stable > sort. It may be faster to use a radix sort instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)