On Mon, 1 Jul 2024 at 20:56, Joel Jacobson <j...@compiler.org> wrote: > > Below is a more realistic benchmark > > CREATE TABLE bench_mul_var (num1 numeric, num2 numeric); > > INSERT INTO bench_mul_var (num1, num2) > SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM > generate_series(1,1e8); > > SELECT SUM(num1*num2) FROM bench_mul_var;
I had a play with this, and came up with a slightly different way of doing it that works for var2 of any size, as long as var1 is just 1 or 2 digits. Repeating your benchmark where both numbers have up to 2 NBASE-digits, this new approach was slightly faster: SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD Time: 4762.990 ms (00:04.763) Time: 4332.166 ms (00:04.332) Time: 4276.211 ms (00:04.276) Time: 4247.321 ms (00:04.247) Time: 4166.738 ms (00:04.167) SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch Time: 4398.812 ms (00:04.399) Time: 3672.668 ms (00:03.673) Time: 3650.227 ms (00:03.650) Time: 3611.420 ms (00:03.611) Time: 3534.218 ms (00:03.534) SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch Time: 3350.596 ms (00:03.351) Time: 3336.224 ms (00:03.336) Time: 3335.599 ms (00:03.336) Time: 3336.990 ms (00:03.337) Time: 3351.453 ms (00:03.351) (This was on an older Intel Core i9-9900K, so I'm not sure why all the timings are faster. What compiler settings are you using?) The approach taken in this patch only uses 32-bit integers, so in theory it could be extended to work for var1ndigits = 3, 4, or even more, but the code would get increasingly complex, and I ran out of steam at 2 digits. It might be worth trying though. Regards, Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c new file mode 100644 index 5510a20..adbfd5c --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -8748,6 +8748,74 @@ mul_var(const NumericVar *var1, const Nu } /* + * Simplified fast-path computation, if var1 has just one or two digits. + * This is significantly faster, since it avoids allocating a separate + * digit array, making multiple passes over var2, and having separate + * carry-propagation passes. + */ + if (var1ndigits <= 2) + { + NumericDigit *res_buf; + + /* Allocate result digit array */ + res_buf = digitbuf_alloc(res_ndigits); + res_buf[0] = 0; /* spare digit for later rounding */ + res_digits = res_buf + 1; + + /* + * Compute the result digits directly, in one pass, propagating the + * carry up as we go. + */ + switch (var1ndigits) + { + case 1: + carry = 0; + for (i = res_ndigits - 3; i >= 0; i--) + { + newdig = (int) var1digits[0] * var2digits[i] + carry; + res_digits[i + 1] = (NumericDigit) (newdig % NBASE); + carry = newdig / NBASE; + } + res_digits[0] = (NumericDigit) carry; + break; + + case 2: + newdig = (int) var1digits[1] * var2digits[res_ndigits - 4]; + res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE); + carry = newdig / NBASE; + + for (i = res_ndigits - 4; i > 0; i--) + { + newdig = (int) var1digits[0] * var2digits[i] + + (int) var1digits[1] * var2digits[i - 1] + carry; + res_digits[i + 1] = (NumericDigit) (newdig % NBASE); + carry = newdig / NBASE; + } + + newdig = (int) var1digits[0] * var2digits[0] + carry; + res_digits[1] = (NumericDigit) (newdig % NBASE); + res_digits[0] = (NumericDigit) (newdig / NBASE); + break; + } + + /* Store the product in result (minus extra rounding digit) */ + digitbuf_free(result->buf); + result->ndigits = res_ndigits - 1; + result->buf = res_buf; + result->digits = res_digits; + result->weight = res_weight - 1; + result->sign = res_sign; + + /* Round to target rscale (and set result->dscale) */ + round_var(result, rscale); + + /* Strip leading and trailing zeroes */ + strip_var(result); + + return; + } + + /* * We do the arithmetic in an array "dig[]" of signed int's. Since * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom * to avoid normalizing carries immediately.