from:"R, Rakshit"

RE: SIMD optimization for list_sort

2025-02-17 Thread R, Rakshit

Hi All,

Thank you very much for your feedback! We investigated the recommendations as 
we developed the current patch. Please refer to the comments below.

> I'd suggest targeting pg_qsort() directly, instead of list_sort(). 
> list_sort() is not used in very performance critical parts.
Using x86-simd-sort would require a new pg_qsort variant taking an extractor 
function. The new API would still have to bubble up to functions like 
list_sort, which internally call pg_qsort. We targeted list_sort first, as we 
were aware of at least one use case that benefits from SIMD sorting (pgvector). 
As we target more use cases (such as tuple sort), we may need to move the 
SIMD-enabled implementation to a common location, and a new variant of pg_qsort 
may be a good candidate.

> I'd suggest proposing this to the pgvector project directly, since pgvector 
> would immediately benefit.
Yes, we’ll share the performance gains on HNSW index build time with the 
pgvector community. However, other users of list_sort (e.g., other extensions) 
may benefit from the simd-sort version of list_sort as well. As it is not a 
pgvector-specific optimization, we propose it for PostgreSQL.

> If you could use this to speed up tuple sorting, that would be much more 
> interesting for PostgreSQL itself.
Enabling x86-simd-sort for tuple sorting is in development. We observed 
promising results, but there is still work to do (e.g., to ensure there are no 
regressions in different conditions). We are planning to share another patch 
for tuple sort building on this one.

> I continually see performance reports when sorting and group order impact 
> performance. Because of that, efforts have been made to optimise sorts with 
> recombination of clause order. It would be nice to review your Sort operator 
> optimisation. I hope, it might improve performance of the cases provided in 
> these threads.
The submitted patch gives more details on the optimization.

Based on the previously shared proposal, we’re providing a patch for enabling 
SIMD-based sort for list_sort using the x86-simd-sort library 
(https://github.com/intel/x86-simd-sort). 
As described earlier, we add a new function that takes an extractor function.

void list_sort_simd_float(List *list, list_sort_extractor_float extract, 
list_sort_comparator cmp); 

The extract function retrieves the float values based on which the ListCells 
are sorted by x86-simd-sort. The comparator function is used to break ties. To 
break the ties, we call the existing list_sort function, passing the 
comparator. 
The dependency on x86-simd-sort is optional through a --with-simdsort flag. If 
Postgres is not built with x86-simd-sort, the above function falls back to the 
existing list_sort.
The current patch provides a SIMD implementation for sorting float values. 
Support for additional data types (such as int32) will be added in the future.
Note that there are a few unexpected changes in the generated configure file. 
We will investigate.

An extension is also provided to stress test list_sort (under 
/src/test/modules/test_listsort). The extension defines a “test_listsort” 
function that takes 4 parameters. The first is the size of the list to be 
generated and sorted. The second and the third parameters (“random_number” and 
“tie_break_limit”) are used to introduce repeats (ties) to better simulate 
real-world data. The fourth parameter is a boolean flag to select the 
SIMD-enabled version or the existing list_sort.

To measure performance, we use pgbench to run the function repeatedly. In the 
first test below, a list of size 100K is generated. To populate each cell, a 
random float f is generated. If (f % 500) <= 100 (1/5 of the cells), the cell 
is assigned a number between 0 and 100 (which will cause repeats). The rest of 
the cells are assigned the original random float (for which repeats are 
unlikely). In the second example, tie_break_limit is increased to generate more 
repeats.
The data was collected on an AWS m7i.metal-24xl instance (Ubuntu 22.04, gcc12.3)

tps gains with SIMD-enabled list_sort over existing list_sort:
size=100k, max_random=500, tie_break_limit=100: 2.63x (1 client), 2.15x (48 
clients)
size=100k, max_random=500, tie_break_limit=250: 2.09x (1 client), 1.75x (48 
clients)

The gains are very promising, and we will characterize more conditions. Please 
let us know your thoughts.

Thanks!

Best regards,
Rakshit Rajeev

Best regards,
-Original Message-
From: Andrei Lepikhov  
Sent: Saturday, November 23, 2024 8:50 AM
To: Giacchino, Luca ; 
pgsql-hackers@lists.postgresql.org
Cc: R, Rakshit ; Shankaran, Akash 
; Devulapalli, Raghuveer 

Subject: Re: SIMD optimization for list_sort

On 22/11/2024 06:27, Giacchino, Luca wrote:
> We’d appreciate feedback on this approach. In the meantime, we will 
> complete the patch to share. We also plan to extend SIMD-based sort to 
> tuple sort in the future.
Nice! I con

RE: SIMD optimization for list_sort

2025-02-27 Thread R, Rakshit

Hi All,

Thank you so much for the feedback. 

> I don't think "another extension might use it someday" makes a very strong 
> case, 
> particularly for something that requires a new dependency.

The x86-simdsort library is an optional dependency in Postgres. Also the new 
list sort implementation which uses the x86-simdsort library does not impact 
any of the existing workflows in Postgres. We have at least one use case where 
Pgvector gets a gain of 7 - 10 % in HNSW index build time using the SIMD sort 
implementation. Hence we feel that this patch could be up streamed further. 
Open to further discussion on the same.


> Note that our current implemention is highly optimized for low-cardinality 
> inputs.
> This is needed for aggregate queries. I found this write-up of a 
> couple scalar and vectorized sorts, and they show this library doing 
> very poorly on very-low cardinality inputs. I would look into that 
> before trying to get something in shape to share.
> 
> https://github.com/Voultapher/sort-research-
> rs/blob/main/writeup/intel_avx512/text.md

We ran our extension to stress list sort with low cardinality inputs. For eg, 
for an array of size 100k having repeated values in the range 1-10 we still see 
a gain of around 20% in throughput.
We will collect more data for low cardinality inputs and with AVX2 too.

Thanks and regards
Rakshit Rajeev

Proposal for optimizations with simd enabled sort

2025-03-28 Thread R, Rakshit

Hi All,
We would like to share a proposal for optimizations with simd enabled sort.

Sorting is a common task in Postgres, for example, for sorting tuples (e.g., 
for queries requiring ordered, grouped, or unique rows) or arrays of elements 
(e.g., list_sort). We propose integrating SIMD-based sorting to improve 
performance of these use cases on x86 platforms. With SIMD sort, we observe 
performance gains in different scenarios, such as ORDER BY queries and pgvector 
workloads (see results below).

SIMD-enabled sort functions are provided by the x86-simd-sort library [1]. The 
library includes AVX-512 and AVX2 accelerated sort functions, and it is already 
enabled in several ecosystems, such as numpy (10-17x speed-up [2]), OpenJDK 
[3], libgrape (graph processing library), and PyTorch [4].

This approach introduces x86-simd-sort as a dependency in Postgres. However, 
the dependency is optional, controlled by a configuration option (such as 
--with-simdsort). If building without the dependency, the SIMD sort functions 
will not be included and the existing functions will be used.


Quicksort

Postgres provides a quicksort implementation (qsort) with variants supporting 
different data types generated by a template (sort_template.h). Tuple sort and 
list sort rely on these qsort functions. Therefore, introducing SIMD 
implementations of qsort will benefit them.

The existing qsort uses a comparator function. On the other hand, x86-simd-sort 
requires an array of numeric values to sort (integer or floating point). To 
generate that array, we need a function to "extract" a numeric value from each 
of the items being sorted (SortTuple, ListCell...). The array will be in memory 
and requires memory allocation. A comparator can still be passed to be used as 
tie-breaker (for additional sorting criteria).

New qsort variants will take an extract function as an additional parameter. 
For example, a new variant of pg_qsort will be:

void pg_qsort_simd_float(void *base, size_t nel, size_t elsize, int (*cmp) 
(const void *, const void *), float(*extract) (const void *));

Additional variants will cover different return types for the extract function 
(double and 16/32/64-bit signed/unsigned integers). These functions will exist 
alongside the existing pg_qsort (which is unmodified).

We plan to generate SIMD-enabled qsort functions by extending the existing sort 
template with additional macros (to define the extract function to use). The 
template will then generate both a quicksort implementation (unmodified) and a 
SIMD-enabled one.


Tuple Sort

- SIMD sort is supported for integer and floating point columns.
- Extract functions are provided to extract values from each SortTuple (similar 
to how the type-specific comparator functions are provided). The relevant 
extract function is passed to the sort template to generate the corresponding 
SIMD sort function.
- Logic in tuplesort_sort_memtuples controls whether to call the regular qsort 
or the SIMD version. In general, we observe that the benefit of SIMD sorting 
decreases with higher tuple count and lower cardinality (see data below). Tuple 
count is easily available to this logic, but estimating cardinality requires 
more investigation (for example, using statistics at planning time or query 
execution time).
- For sorting criteria involving multiple columns, we currently use SIMD 
sorting for the first column, and we rely on the existing quicksort for other 
columns.


List Sort

list_sort sorts an array of ListCells using qsort with a comparator. Following 
a similar approach, we define additional variants that take an extract 
function. For example

typedef float (*list_sort_extractor_float)(const ListCell *a);
void list_sort_simd_float(List *list, list_sort_extractor_float extract, 
list_sort_comparator cmp);

Additional variants will cover different return types for the extract function 
(double and 16/32/64-bit signed/unsigned integers). These functions will exist 
alongside the existing list_sort (unmodified), and they will call SIMD versions 
of qsort. Existing list_sort use cases can be converted to list_sort_simd 
functions where beneficial in terms of performance.


Results

By enabling x86-simd-sort, we observe the following in our preliminary 
measurements. All data below uses x86-simd-sort with AVX-512. We will share 
AVX2 data as well.

Tuple sort
ORDER BY queries with randomly-generated data, with varying row count and 
cardinality (controlled by number of distinct values), single-column filter.
Collected on AWS m7i.metal-24xl, Ubuntu 24.04, gcc13
100k rows, 100k distinct values: +13.4%
100k rows,  10k distinct values:  +9.8%
100k rows, 100  distinct values:  +3.0%
100k rows,  10  distinct values:  -1.9%
  1M rows,   1M distinct values:  +6.5%
  1M rows,  10k distinct values:  +1.6%
  1M rows, 100  distinct values:  -2.9%
  1M rows,  10  distinct values:  -5.1%
As mentioned earlier, we observe the benefit of SIMD sorting to decrease with 
highe

RE: SIMD optimization for list_sort

RE: SIMD optimization for list_sort

Proposal for optimizations with simd enabled sort

3 matches

Site Navigation

Mail list logo

Footer information