There is this for loop in mul_var() : /* * Add the appropriate multiple of var2 into the accumulator. * * As above, digits of var2 can be ignored if they don't contribute, * so we only include digits for which i1+i2+2 <= res_ndigits - 1. */ for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; i2 >= 0; i2--) dig[i--] += var1digit * var2digits[i2];
With gcc -O3, the above for loop, if simplified, gets auto-vectorized [1] ; and this results in speedups for multiplication of PostgreSQL numeric types having large precisions. The speedups start becoming noticeable from around 50 precision onwards. With 50 precision the improvement I saw was 5%, with 60 11%, 120 50%, 240 2.2x, and so on. On my arm64 machine, a similar benefit starts showing up from 20 precision onwards. I used this query from regress/sql/numeric_big.sql : SELECT t1.val * t2.val FROM num_data t1, num_data t2 If I use the schema created by numeric_big.sql, the speedup was 2.5x to 2.7x across three machines. Also, the regress/sql/numeric_big test itself speeds up by 80% For the for loop to be auto-vectorized, I had to simplify it to something like this : i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); digptr = &dig[i1 + 2]; for (i = 0; i <= i2; i++) digptr[i] += var1digit * var2digits[i]; gcc also can vectorize backward loop such as this : for (i = n-1; i >= 0; i--) a += b[i]; gcc -fopt-info-all gives this info : numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors But if the assignment is not as simple as above, it does not vectorize the backward traversal : i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); digptr = &dig[i1 + i2 + 2]; for (; i2 >= 0; i2--) digptr[i2] += var1digit * var2digits[i2]; numeric.c:7380:3: missed: couldn't vectorize loop numeric.c:7381:15: missed: not vectorized: relevant stmt not supported: _39 = *_38; Even for forward loop traversal, the below can't be vectorized seemingly because it involves two variables : count = Min(var2ndigits - 1, res_ndigits - i1 - 3) + 1; i = i1 + i2 - count + 3; for (i2 = 0; i2 < count; i++, i2++) dig[i] += var1digit * var2digits[i2]; numeric.c:7394:3: missed: couldn't vectorize loop numeric.c:7395:11: missed: not vectorized: not suitable for gather load _37 = *_36; So it's better to keep the loop simple : i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); digptr = &dig[i1 + 2]; for (i = 0; i <= i2; i++) digptr[i] += var1digit * var2digits[i]; numeric.c:7387:3: optimized: loop vectorized using 16 byte vectors Attached is the patch that uses the above loop. With the patch, in mul_var() assembly code, I could see the multiply-accumulate instructions that operate on SIMD vectors (these are arm64 instructions) : smlal v1.4s, v2.4h, v3.4h smlal2 v0.4s, v2.8h, v3.8h I extracted the "SELECT t1.val * t2.val FROM num_data t1, num_data t2" query from regress/sql/numeric_big.sql, and ran it on the data that the test creates (it inserts values with precisions ranging from 500 to 700). Attached is create_schema.sql which creates the regression test schema. With this query, below are the changes in mul_var() figures with and without patch : (All the below figures are with -O3 build.) HEAD : + 64.06% postgres postgres [.] mul_var + 13.00% postgres postgres [.] get_str_from_var + 6.32% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 1.65% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 1.10% postgres [kernel.kallsyms] [k] _raw_spin_lock + 0.96% postgres [kernel.kallsyms] [k] get_page_from_freelist + 0.73% postgres [kernel.kallsyms] [k] page_counter_try_charge + 0.64% postgres postgres [.] AllocSetAlloc Patched : + 35.91% postgres postgres [.] mul_var + 20.43% postgres postgres [.] get_str_from_var + 13.01% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 2.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 1.48% postgres [kernel.kallsyms] [k] _raw_spin_lock + 1.15% postgres [kernel.kallsyms] [k] get_page_from_freelist + 0.99% postgres postgres [.] AllocSetAlloc + 0.58% postgres postgres [.] base_yyparse Times in milliseconds for SELECT t1.val * t2.val FROM num_data t1, num_data t2 : Machine 1 (amd64) Head : .668 .723 .658 .660 Patched : .288 .280 .282 .282 Machine 2 (arm64) Head : .897 .879 .888 .897 Patched : .329 .324 .321 .320 Average times in milliseconds for numeric_big regression test : Machine 1 (amd64) Head : 801 Patched : 445 Machine 2 (arm64) Head : 1105 Patched : 550 gcc -O3 option : I understand we have kept the default gcc CFLAGS to -O2, because, I believe, we might enable some bugs due to some assumptions in the code, if we make it -O3. But with this patch, we allow products built with -O3 flag to get this benefit. The actual gcc option to enable auto-vectorization is -ftree-loop-vectorize. But for -O3 it is always true. What we can do in the future is to have a separate file that has such optimized code that is proven to work with such optimization flags, and enable the required compiler flags only for such files, if the build is done with -O2. [1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using -- Thanks, -Amit Khandekar Huawei Technologies
create_schema.sql
Description: application/sql
From 91aa5ef1034b763e279b4a8970101355e4a79600 Mon Sep 17 00:00:00 2001 From: Amit Khandekar <amitdkhan...@gmail.com> Date: Tue, 9 Jun 2020 16:35:23 +0530 Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric product A 'for' loop in mul_var() runs backwards by decrementing two variables. This prevents the gcc compiler from auto-vectorizing the for loop. So make it a forward loop with a single variable. This gives performance benefits for product of numeric types with large precision, with speedups becoming noticeable from values with precisions starting from 20-40. Typical pattern of benefit is : precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on. On some CPU architectures, the speedup starts from 20 precision onwards. With the precisions used in the numeric_big regression test, the multiplication speeds up by 2.5 to 2.7 times. Auto-vectorization happens with -O3 flag or -ftree-loop-vectorize. So this benefit with be seen when built with gcc -O3. --- src/backend/utils/adt/numeric.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index f3a725271e..4243242ad9 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, int res_weight; int maxdigits; int *dig; + int *digptr; int carry; int maxdig; int newdig; @@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, * * As above, digits of var2 can be ignored if they don't contribute, * so we only include digits for which i1+i2+2 <= res_ndigits - 1. + * + * For large precisions, this can become a bottleneck; so keep this for + * loop simple so that it can be auto-vectorized. */ - for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2; - i2 >= 0; i2--) - dig[i--] += var1digit * var2digits[i2]; + i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3); + digptr = &dig[i1 + 2]; + for (i = 0; i <= i2; i++) + digptr[i] += var1digit * var2digits[i]; } /* -- 2.17.1