On Tue, 2 Jul 2024 at 08:50, Joel Jacobson <j...@compiler.org> wrote:
>
> On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:
>
> > Attachments:
> > * optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
>

Shortly after posting that, I realised that there was a small bug. This bit:

            case 2:
                newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];

isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:

                newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
                if (res_ndigits - 3 < var2ndigits)
                    newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];

That doesn't noticeably affect the performance though. Update attached.

> I've benchmarked your patch on my three machines with great results.
> I added a setseed() step, to make the benchmarks reproducible,
> shouldn't matter much since it should statistically average out, but I 
> thought why not.

Nice. The results on the Apple M3 Max look particularly impressive.

I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 5510a20..f77ad8b
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8748,6 +8748,76 @@ mul_var(const NumericVar *var1, const Nu
 	}
 
 	/*
+	 * Simplified fast-path computation, if var1 has just one or two digits.
+	 * This is significantly faster, since it avoids allocating a separate
+	 * digit array, making multiple passes over var2, and having separate
+	 * carry-propagation passes.
+	 */
+	if (var1ndigits <= 2)
+	{
+		NumericDigit *res_buf;
+
+		/* Allocate result digit array */
+		res_buf = digitbuf_alloc(res_ndigits);
+		res_buf[0] = 0;				/* spare digit for later rounding */
+		res_digits = res_buf + 1;
+
+		/*
+		 * Compute the result digits directly, in one pass, propagating the
+		 * carry up as we go.
+		 */
+		switch (var1ndigits)
+		{
+			case 1:
+				carry = 0;
+				for (i = res_ndigits - 3; i >= 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+				res_digits[0] = (NumericDigit) carry;
+				break;
+
+			case 2:
+				newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
+				if (res_ndigits - 3 < var2ndigits)
+					newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
+				res_digits[res_ndigits - 2] = (NumericDigit) (newdig % NBASE);
+				carry = newdig / NBASE;
+
+				for (i = res_ndigits - 4; i > 0; i--)
+				{
+					newdig = (int) var1digits[0] * var2digits[i] +
+						(int) var1digits[1] * var2digits[i - 1] + carry;
+					res_digits[i + 1] = (NumericDigit) (newdig % NBASE);
+					carry = newdig / NBASE;
+				}
+
+				newdig = (int) var1digits[0] * var2digits[0] + carry;
+				res_digits[1] = (NumericDigit) (newdig % NBASE);
+				res_digits[0] = (NumericDigit) (newdig / NBASE);
+				break;
+		}
+
+		/* Store the product in result (minus extra rounding digit) */
+		digitbuf_free(result->buf);
+		result->ndigits = res_ndigits - 1;
+		result->buf = res_buf;
+		result->digits = res_digits;
+		result->weight = res_weight - 1;
+		result->sign = res_sign;
+
+		/* Round to target rscale (and set result->dscale) */
+		round_var(result, rscale);
+
+		/* Strip leading and trailing zeroes */
+		strip_var(result);
+
+		return;
+	}
+
+	/*
 	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
 	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
 	 * to avoid normalizing carries immediately.

Reply via email to