Hi All,
We would like to share a proposal for optimizations with simd enabled sort.

Sorting is a common task in Postgres, for example, for sorting tuples (e.g., 
for queries requiring ordered, grouped, or unique rows) or arrays of elements 
(e.g., list_sort). We propose integrating SIMD-based sorting to improve 
performance of these use cases on x86 platforms. With SIMD sort, we observe 
performance gains in different scenarios, such as ORDER BY queries and pgvector 
workloads (see results below).

SIMD-enabled sort functions are provided by the x86-simd-sort library [1]. The 
library includes AVX-512 and AVX2 accelerated sort functions, and it is already 
enabled in several ecosystems, such as numpy (10-17x speed-up [2]), OpenJDK 
[3], libgrape (graph processing library), and PyTorch [4].

This approach introduces x86-simd-sort as a dependency in Postgres. However, 
the dependency is optional, controlled by a configuration option (such as 
--with-simdsort). If building without the dependency, the SIMD sort functions 
will not be included and the existing functions will be used.


Quicksort

Postgres provides a quicksort implementation (qsort) with variants supporting 
different data types generated by a template (sort_template.h). Tuple sort and 
list sort rely on these qsort functions. Therefore, introducing SIMD 
implementations of qsort will benefit them.

The existing qsort uses a comparator function. On the other hand, x86-simd-sort 
requires an array of numeric values to sort (integer or floating point). To 
generate that array, we need a function to "extract" a numeric value from each 
of the items being sorted (SortTuple, ListCell...). The array will be in memory 
and requires memory allocation. A comparator can still be passed to be used as 
tie-breaker (for additional sorting criteria).

New qsort variants will take an extract function as an additional parameter. 
For example, a new variant of pg_qsort will be:

void pg_qsort_simd_float(void *base, size_t nel, size_t elsize, int (*cmp) 
(const void *, const void *), float(*extract) (const void *));

Additional variants will cover different return types for the extract function 
(double and 16/32/64-bit signed/unsigned integers). These functions will exist 
alongside the existing pg_qsort (which is unmodified).

We plan to generate SIMD-enabled qsort functions by extending the existing sort 
template with additional macros (to define the extract function to use). The 
template will then generate both a quicksort implementation (unmodified) and a 
SIMD-enabled one.


Tuple Sort

- SIMD sort is supported for integer and floating point columns.
- Extract functions are provided to extract values from each SortTuple (similar 
to how the type-specific comparator functions are provided). The relevant 
extract function is passed to the sort template to generate the corresponding 
SIMD sort function.
- Logic in tuplesort_sort_memtuples controls whether to call the regular qsort 
or the SIMD version. In general, we observe that the benefit of SIMD sorting 
decreases with higher tuple count and lower cardinality (see data below). Tuple 
count is easily available to this logic, but estimating cardinality requires 
more investigation (for example, using statistics at planning time or query 
execution time).
- For sorting criteria involving multiple columns, we currently use SIMD 
sorting for the first column, and we rely on the existing quicksort for other 
columns.


List Sort

list_sort sorts an array of ListCells using qsort with a comparator. Following 
a similar approach, we define additional variants that take an extract 
function. For example

typedef float (*list_sort_extractor_float)(const ListCell *a);
void list_sort_simd_float(List *list, list_sort_extractor_float extract, 
list_sort_comparator cmp);

Additional variants will cover different return types for the extract function 
(double and 16/32/64-bit signed/unsigned integers). These functions will exist 
alongside the existing list_sort (unmodified), and they will call SIMD versions 
of qsort. Existing list_sort use cases can be converted to list_sort_simd 
functions where beneficial in terms of performance.


Results

By enabling x86-simd-sort, we observe the following in our preliminary 
measurements. All data below uses x86-simd-sort with AVX-512. We will share 
AVX2 data as well.

Tuple sort
ORDER BY queries with randomly-generated data, with varying row count and 
cardinality (controlled by number of distinct values), single-column filter.
Collected on AWS m7i.metal-24xl, Ubuntu 24.04, gcc13
100k rows, 100k distinct values: +13.4%
100k rows,  10k distinct values:  +9.8%
100k rows, 100  distinct values:  +3.0%
100k rows,  10  distinct values:  -1.9%
  1M rows,   1M distinct values:  +6.5%
  1M rows,  10k distinct values:  +1.6%
  1M rows, 100  distinct values:  -2.9%
  1M rows,  10  distinct values:  -5.1%
As mentioned earlier, we observe the benefit of SIMD sorting to decrease with 
higher row count and lower cardinality. We are exploring additional 
optimizations to further improve performance (for example, sorting/merging 
smaller runs is showing promising results).

List sort (pgvector)
pgvector uses list_sort to sort candidate vectors by distance.
Using list_sort_simd_float, we observed 7% reduction in index build time with 
the gist-960 dataset and 10% with the dbpedia-openai-1000k dataset 
(ANN-Benchmarks, HNSW index with halfvec, m=80).

list sort (microbenchmark)
We developed a microbenchmark to stress list_sort. We observe gains of 
1.75x-2.63x with varying cardinality and clients (refer to [5]). Gains are 
lower but still present at even lower cardinality. For example, we observe 20% 
gain with 100k items and 10 distinct values).


In summary, we observe performance gains with x86-simd-sort in several 
scenarios. There are still open questions to "detect" low-cardinality data and 
route to the best-performing implementation.
Please let us know your thoughts. If this is a viable approach for integrating 
SIMD-based sorting, we will complete and submit a first rev of the patch.


[1] https://github.com/intel/x86-simd-sort
[2] https://github.com/numpy/numpy/pull/22315/
[3] https://github.com/openjdk/jdk/pull/14227
[4] https://github.com/pytorch/pytorch/pull/149362
[5] 
https://www.postgresql.org/message-id/CO1PR11MB506056215E799493EC5F42E8EDFE2%40CO1PR11MB5060.namprd11.prod.outlook.com

Thanks
Rakshit Rajeev

Reply via email to