RE: SIMD optimization for list_sort
Hi All, Thank you very much for your feedback! We investigated the recommendations as we developed the current patch. Please refer to the comments below. > I'd suggest targeting pg_qsort() directly, instead of list_sort(). > list_sort() is not used in very performance critical parts. Using x86-simd-sort would require a new pg_qsort variant taking an extractor function. The new API would still have to bubble up to functions like list_sort, which internally call pg_qsort. We targeted list_sort first, as we were aware of at least one use case that benefits from SIMD sorting (pgvector). As we target more use cases (such as tuple sort), we may need to move the SIMD-enabled implementation to a common location, and a new variant of pg_qsort may be a good candidate. > I'd suggest proposing this to the pgvector project directly, since pgvector > would immediately benefit. Yes, we’ll share the performance gains on HNSW index build time with the pgvector community. However, other users of list_sort (e.g., other extensions) may benefit from the simd-sort version of list_sort as well. As it is not a pgvector-specific optimization, we propose it for PostgreSQL. > If you could use this to speed up tuple sorting, that would be much more > interesting for PostgreSQL itself. Enabling x86-simd-sort for tuple sorting is in development. We observed promising results, but there is still work to do (e.g., to ensure there are no regressions in different conditions). We are planning to share another patch for tuple sort building on this one. > I continually see performance reports when sorting and group order impact > performance. Because of that, efforts have been made to optimise sorts with > recombination of clause order. It would be nice to review your Sort operator > optimisation. I hope, it might improve performance of the cases provided in > these threads. The submitted patch gives more details on the optimization. Based on the previously shared proposal, we’re providing a patch for enabling SIMD-based sort for list_sort using the x86-simd-sort library (https://github.com/intel/x86-simd-sort). As described earlier, we add a new function that takes an extractor function. void list_sort_simd_float(List *list, list_sort_extractor_float extract, list_sort_comparator cmp); The extract function retrieves the float values based on which the ListCells are sorted by x86-simd-sort. The comparator function is used to break ties. To break the ties, we call the existing list_sort function, passing the comparator. The dependency on x86-simd-sort is optional through a --with-simdsort flag. If Postgres is not built with x86-simd-sort, the above function falls back to the existing list_sort. The current patch provides a SIMD implementation for sorting float values. Support for additional data types (such as int32) will be added in the future. Note that there are a few unexpected changes in the generated configure file. We will investigate. An extension is also provided to stress test list_sort (under /src/test/modules/test_listsort). The extension defines a “test_listsort” function that takes 4 parameters. The first is the size of the list to be generated and sorted. The second and the third parameters (“random_number” and “tie_break_limit”) are used to introduce repeats (ties) to better simulate real-world data. The fourth parameter is a boolean flag to select the SIMD-enabled version or the existing list_sort. To measure performance, we use pgbench to run the function repeatedly. In the first test below, a list of size 100K is generated. To populate each cell, a random float f is generated. If (f % 500) <= 100 (1/5 of the cells), the cell is assigned a number between 0 and 100 (which will cause repeats). The rest of the cells are assigned the original random float (for which repeats are unlikely). In the second example, tie_break_limit is increased to generate more repeats. The data was collected on an AWS m7i.metal-24xl instance (Ubuntu 22.04, gcc12.3) tps gains with SIMD-enabled list_sort over existing list_sort: size=100k, max_random=500, tie_break_limit=100: 2.63x (1 client), 2.15x (48 clients) size=100k, max_random=500, tie_break_limit=250: 2.09x (1 client), 1.75x (48 clients) The gains are very promising, and we will characterize more conditions. Please let us know your thoughts. Thanks! Best regards, Rakshit Rajeev Best regards, -Original Message- From: Andrei Lepikhov Sent: Saturday, November 23, 2024 8:50 AM To: Giacchino, Luca ; pgsql-hackers@lists.postgresql.org Cc: R, Rakshit ; Shankaran, Akash ; Devulapalli, Raghuveer Subject: Re: SIMD optimization for list_sort On 22/11/2024 06:27, Giacchino, Luca wrote: > We’d appreciate feedback on this approach. In the meantime, we will > complete the patch to share. We also plan to extend SIMD-based sort to > tuple sort in the future. Nice! I con
RE: SIMD optimization for list_sort
Hi All, Thank you so much for the feedback. > I don't think "another extension might use it someday" makes a very strong > case, > particularly for something that requires a new dependency. The x86-simdsort library is an optional dependency in Postgres. Also the new list sort implementation which uses the x86-simdsort library does not impact any of the existing workflows in Postgres. We have at least one use case where Pgvector gets a gain of 7 - 10 % in HNSW index build time using the SIMD sort implementation. Hence we feel that this patch could be up streamed further. Open to further discussion on the same. > Note that our current implemention is highly optimized for low-cardinality > inputs. > This is needed for aggregate queries. I found this write-up of a > couple scalar and vectorized sorts, and they show this library doing > very poorly on very-low cardinality inputs. I would look into that > before trying to get something in shape to share. > > https://github.com/Voultapher/sort-research- > rs/blob/main/writeup/intel_avx512/text.md We ran our extension to stress list sort with low cardinality inputs. For eg, for an array of size 100k having repeated values in the range 1-10 we still see a gain of around 20% in throughput. We will collect more data for low cardinality inputs and with AVX2 too. Thanks and regards Rakshit Rajeev
Proposal for optimizations with simd enabled sort
Hi All, We would like to share a proposal for optimizations with simd enabled sort. Sorting is a common task in Postgres, for example, for sorting tuples (e.g., for queries requiring ordered, grouped, or unique rows) or arrays of elements (e.g., list_sort). We propose integrating SIMD-based sorting to improve performance of these use cases on x86 platforms. With SIMD sort, we observe performance gains in different scenarios, such as ORDER BY queries and pgvector workloads (see results below). SIMD-enabled sort functions are provided by the x86-simd-sort library [1]. The library includes AVX-512 and AVX2 accelerated sort functions, and it is already enabled in several ecosystems, such as numpy (10-17x speed-up [2]), OpenJDK [3], libgrape (graph processing library), and PyTorch [4]. This approach introduces x86-simd-sort as a dependency in Postgres. However, the dependency is optional, controlled by a configuration option (such as --with-simdsort). If building without the dependency, the SIMD sort functions will not be included and the existing functions will be used. Quicksort Postgres provides a quicksort implementation (qsort) with variants supporting different data types generated by a template (sort_template.h). Tuple sort and list sort rely on these qsort functions. Therefore, introducing SIMD implementations of qsort will benefit them. The existing qsort uses a comparator function. On the other hand, x86-simd-sort requires an array of numeric values to sort (integer or floating point). To generate that array, we need a function to "extract" a numeric value from each of the items being sorted (SortTuple, ListCell...). The array will be in memory and requires memory allocation. A comparator can still be passed to be used as tie-breaker (for additional sorting criteria). New qsort variants will take an extract function as an additional parameter. For example, a new variant of pg_qsort will be: void pg_qsort_simd_float(void *base, size_t nel, size_t elsize, int (*cmp) (const void *, const void *), float(*extract) (const void *)); Additional variants will cover different return types for the extract function (double and 16/32/64-bit signed/unsigned integers). These functions will exist alongside the existing pg_qsort (which is unmodified). We plan to generate SIMD-enabled qsort functions by extending the existing sort template with additional macros (to define the extract function to use). The template will then generate both a quicksort implementation (unmodified) and a SIMD-enabled one. Tuple Sort - SIMD sort is supported for integer and floating point columns. - Extract functions are provided to extract values from each SortTuple (similar to how the type-specific comparator functions are provided). The relevant extract function is passed to the sort template to generate the corresponding SIMD sort function. - Logic in tuplesort_sort_memtuples controls whether to call the regular qsort or the SIMD version. In general, we observe that the benefit of SIMD sorting decreases with higher tuple count and lower cardinality (see data below). Tuple count is easily available to this logic, but estimating cardinality requires more investigation (for example, using statistics at planning time or query execution time). - For sorting criteria involving multiple columns, we currently use SIMD sorting for the first column, and we rely on the existing quicksort for other columns. List Sort list_sort sorts an array of ListCells using qsort with a comparator. Following a similar approach, we define additional variants that take an extract function. For example typedef float (*list_sort_extractor_float)(const ListCell *a); void list_sort_simd_float(List *list, list_sort_extractor_float extract, list_sort_comparator cmp); Additional variants will cover different return types for the extract function (double and 16/32/64-bit signed/unsigned integers). These functions will exist alongside the existing list_sort (unmodified), and they will call SIMD versions of qsort. Existing list_sort use cases can be converted to list_sort_simd functions where beneficial in terms of performance. Results By enabling x86-simd-sort, we observe the following in our preliminary measurements. All data below uses x86-simd-sort with AVX-512. We will share AVX2 data as well. Tuple sort ORDER BY queries with randomly-generated data, with varying row count and cardinality (controlled by number of distinct values), single-column filter. Collected on AWS m7i.metal-24xl, Ubuntu 24.04, gcc13 100k rows, 100k distinct values: +13.4% 100k rows, 10k distinct values: +9.8% 100k rows, 100 distinct values: +3.0% 100k rows, 10 distinct values: -1.9% 1M rows, 1M distinct values: +6.5% 1M rows, 10k distinct values: +1.6% 1M rows, 100 distinct values: -2.9% 1M rows, 10 distinct values: -5.1% As mentioned earlier, we observe the benefit of SIMD sorting to decrease with highe