Re: Optimize mul_var() for var1ndigits >= 8

Dean Rasheed Sun, 28 Jul 2024 12:18:39 -0700

On Fri, 12 Jul 2024 at 13:34, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>
> Then I tried compiling with -m32, and unfortunately this made the
> patch slower than HEAD for small inputs:
>
> -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
> call rate=5.052332e+06  -- HEAD
> call rate=3.883459e+06  -- v2 patch
>
> -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
> call rate=4.7221405e+06  -- HEAD
> call rate=3.7965685e+06  -- v2 patch
>
> -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
> call rate=4.4638375e+06   -- HEAD
> call rate=3.39948e+06     -- v2 patch
>
> and it doesn't reach parity until around ndigits=26, which is
> disappointing. It does still get much faster for large inputs
>


I spent some more time hacking on this, to try to improve the
situation for 32-bit builds. One of the biggest factors was the 64-bit
division that is necessary during carry propagation, which is very
slow -- every compiler/platform I looked at on godbolt.org turns this
into a call to a slow function like __udivdi3(). However, since we are
dividing by a constant (NBASE^2), it can be done using the same
multiply-and-shift trick that compilers use in 64-bit builds, and that
significantly improves performance.

Unfortunately, in 32-bit builds, using 64-bit integers is still slower
for small inputs (below 20-30 NBASE digits), so in the end I figured
that it's best to retain the old 32-bit base-NBASE multiplication code
for small inputs, and only use 64-bit base-NBASE^2 multiplication when
the inputs are above a certain size threshold. Furthermore, since this
threshold is quite low, it's possible to make an additional
simplification: as long as the threshold is less than or equal to 42,
it can be shown that there is no chance of 32-bit integer overflow,
and so the "maxdig" tracking and renormalisation code is not needed.
Getting rid of that makes the outer multiplication loop very simple,
and makes quite a noticeable difference to the performance for inputs
below the threshold.

In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster
for all inputs over 4 or 5 NBASE digits, so the threshold can be much
lower. However, for numbers near that threshold, it's a close thing,
so it makes sense to also extend mul_var_small() to cover 1-6 digit
inputs, since that gives a much larger gain for numbers of that size.
That's quite nice because it equates to inputs with up to 21-24
decimal digits, which I suspect are quite commonly used in practice.

One risk in having different thresholds in 32-bit and 64-bit builds is
that it opens up the possibility that the results from the
reduced-rscale computations used by inexact functions like ln() and
exp() might be be different (actually, this may already be a
possibility, due to div_var_fast()'s use of floating point arithmetic,
but that seems considerably less likely). In practice though, this
should be extremely unlikely, due to the fact that any difference
would have to propagate through MUL_GUARD_DIGITS NBASE digits (8
decimal digits), and it doesn't show up in any of the existing tests.
IMO a very small chance of different results on different platforms is
probably acceptable, but it wouldn't be acceptable to make the
threshold a runtime configurable parameter that could lead to
different results on the same platform.

Overall, this has turned out to be more code than I would have liked,
but I think it's worth it, because the performance gains are pretty
substantial across the board.

(Here, I'm comparing with REL_17_STABLE, so that the effect of
mul_var_short() for ndigits <= 6 can be seen.)

32-bit build
============

Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
----------+----------+---------------+---------------+----------
        1 |        1 |  5.973068e+06 |  6.873059e+06 | +15.07%
        2 |        2 |  5.646598e+06 | 6.6456665e+06 | +17.69%
        3 |        3 | 5.8176995e+06 | 7.0903175e+06 | +21.87%
        4 |        4 |  5.545298e+06 | 6.9701605e+06 | +25.69%
        5 |        5 | 5.2109155e+06 | 6.6406185e+06 | +27.44%
        6 |        6 | 4.9276335e+06 |  6.478698e+06 | +31.48%
        7 |        7 | 4.6595095e+06 | 4.8514485e+06 | +4.12%
        8 |        8 |  4.898536e+06 | 5.1599975e+06 | +5.34%
        9 |        9 |  4.537171e+06 |  4.830987e+06 | +6.48%
       10 |       10 |  4.308139e+06 | 4.6029265e+06 | +6.84%
       11 |       11 |  4.092612e+06 |   4.37966e+06 | +7.01%
       12 |       12 | 3.9345035e+06 |  4.213998e+06 | +7.10%
       13 |       13 | 3.7600162e+06 | 4.0115955e+06 | +6.69%
       14 |       14 | 3.5959855e+06 | 3.8216508e+06 | +6.28%
       15 |       15 | 3.3576732e+06 | 3.6070588e+06 | +7.43%
       16 |       16 |  3.657139e+06 | 3.9067975e+06 | +6.83%
       17 |       17 | 3.3601955e+06 | 3.5651722e+06 | +6.10%
       18 |       18 | 3.1844472e+06 | 3.4542238e+06 | +8.47%
       19 |       19 | 3.0286392e+06 | 3.2380662e+06 | +6.91%
       20 |       20 | 2.9012185e+06 | 3.0496352e+06 | +5.12%
       21 |       21 |  2.755444e+06 | 2.9508798e+06 | +7.09%
       22 |       22 | 2.6263908e+06 | 2.8168945e+06 | +7.25%
       23 |       23 | 2.5470438e+06 | 2.7056318e+06 | +6.23%
       24 |       24 |  2.764418e+06 | 2.9597732e+06 | +7.07%
       25 |       25 |  2.509954e+06 | 2.7368215e+06 | +9.04%
       26 |       26 | 2.3699395e+06 |  2.565404e+06 | +8.25%
       27 |       27 |  2.286344e+06 | 2.4400948e+06 | +6.72%
       28 |       28 |  2.199087e+06 |  2.361374e+06 | +7.38%
       29 |       29 | 2.1208148e+06 |   2.26925e+06 | +7.00%
       30 |       30 | 2.0469475e+06 | 2.2127455e+06 | +8.10%
       31 |       31 | 1.9973804e+06 | 2.3771615e+06 | +19.01%
       32 |       32 | 2.1288205e+06 | 2.3166555e+06 | +8.82%
       33 |       33 | 1.9876898e+06 | 2.1759028e+06 | +9.47%
       34 |       34 | 1.8906434e+06 |  2.169534e+06 | +14.75%
       35 |       35 | 1.8069352e+06 |  1.990085e+06 | +10.14%
       36 |       36 | 1.7412926e+06 | 1.9940908e+06 | +14.52%
       37 |       37 | 1.6956435e+06 | 1.8492525e+06 | +9.06%
       38 |       38 | 1.6253338e+06 | 1.8493976e+06 | +13.79%
       39 |       39 | 1.5734566e+06 | 1.9296996e+06 | +22.64%
       40 |       40 | 1.6692021e+06 |  1.902438e+06 | +13.97%
       50 |       50 | 1.1116885e+06 | 1.5319529e+06 | +37.80%
      100 |      100 |     399552.38 |     722142.44 | +80.74%
      250 |      250 |      81092.02 |     195967.67 | +141.66%
      500 |      500 |     21654.633 |     58329.473 | +169.36%
     1000 |     1000 |     5525.9775 |     16420.611 | +197.15%
     2500 |     2500 |     907.98206 |     2825.7324 | +211.21%
     5000 |     5000 |     230.26935 |     731.26105 | +217.57%
    10000 |    10000 |     57.793922 |     179.12334 | +209.93%
    16383 |    16383 |     21.446728 |      64.39028 | +200.23%

Unbalanced inputs:

 ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+-----------+------------+----------
        1 |    10000 |  42816.89 |   52843.01 | +23.42%
        2 |    10000 |  41032.25 |  52111.957 | +27.00%
        3 |    10000 | 39550.176 |  52262.477 | +32.14%
        4 |    10000 |  38015.59 |  43962.535 | +15.64%
        5 |    10000 |  36560.22 |  43736.305 | +19.63%
        6 |    10000 |  35305.77 |    38204.2 | +8.21%
        7 |    10000 | 33833.086 |  36533.387 | +7.98%
        8 |    10000 | 32847.996 |  35774.715 | +8.91%
        9 |    10000 | 31345.736 |  33831.926 | +7.93%
       10 |    10000 |   30351.6 |  32715.969 | +7.79%
       11 |    10000 | 29321.592 |  31478.398 | +7.36%
       12 |    10000 | 28616.018 |  30861.885 | +7.85%
       13 |    10000 |  28216.12 |   29510.95 | +4.59%
       14 |    10000 | 27396.408 |   29077.73 | +6.14%
       15 |    10000 |  26519.08 |   28235.08 | +6.47%
       16 |    10000 | 25778.102 |  27538.908 | +6.83%
       17 |    10000 | 26024.918 |  26677.498 | +2.51%
       18 |    10000 | 25316.346 |  26729.395 | +5.58%
       19 |    10000 |  24626.07 |  26076.979 | +5.89%
       20 |    10000 | 23912.383 |  25709.967 | +7.52%
       21 |    10000 | 23238.488 |   24761.57 | +6.55%
       22 |    10000 |  22746.25 |  23925.934 | +5.19%
       23 |    10000 | 22120.777 |   23442.34 | +5.97%
       24 |    10000 | 21645.215 |  22771.193 | +5.20%
       25 |    10000 | 21135.049 |  22185.893 | +4.97%
       26 |    10000 | 20685.025 |   21831.74 | +5.54%
       27 |    10000 | 20039.559 |  20854.793 | +4.07%
       28 |    10000 | 19846.092 |  21072.758 | +6.18%
       29 |    10000 | 19414.059 |  20289.414 | +4.51%
       30 |    10000 | 18968.617 |  19774.797 | +4.25%
       31 |    10000 | 18394.988 |  21307.074 | +15.83%
       32 |    10000 | 18291.504 |  21349.691 | +16.72%
       33 |    10000 | 17899.676 |  20885.299 | +16.68%
       34 |    10000 | 17505.402 |  20620.105 | +17.79%
       35 |    10000 | 17174.918 |  20383.594 | +18.68%
       36 |    10000 | 16609.867 |   20342.18 | +22.47%
       37 |    10000 | 16457.889 |   19953.91 | +21.24%
       38 |    10000 |  15926.13 |  19783.203 | +24.22%
       39 |    10000 | 15441.283 |  19288.654 | +24.92%
       40 |    10000 | 15038.773 |   19415.52 | +29.10%
       50 |    10000 | 11264.285 |  17608.648 | +56.32%
      100 |    10000 | 6253.7637 |  11620.726 | +85.82%
      250 |    10000 |  2696.207 |  5939.3857 | +120.29%
      500 |    10000 | 1338.4141 |  3268.2004 | +144.18%
     1000 |    10000 |  672.6717 |  1691.9614 | +151.53%
     2500 |    10000 |  267.5996 |  708.20386 | +164.65%
     5000 |    10000 | 131.50755 |  353.92822 | +169.13%

numeric_big regression test:

ok 224       - numeric_big                               279 ms  [PG17]
ok 224       - numeric_big                               182 ms  [v3 patch]


64-bit build
============

Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
----------+----------+---------------+---------------+----------
        1 |        1 |  8.507485e+06 |   9.53221e+06 | +12.04%
        2 |        2 | 8.0975715e+06 |  9.431853e+06 | +16.48%
        3 |        3 |  6.461359e+06 | 7.3669945e+06 | +14.02%
        4 |        4 | 6.1728355e+06 |  7.152418e+06 | +15.87%
        5 |        5 |  6.500831e+06 | 7.6977115e+06 | +18.41%
        6 |        6 | 6.1784075e+06 | 7.3765005e+06 | +19.39%
        7 |        7 |   5.90117e+06 | 6.2799965e+06 | +6.42%
        8 |        8 | 5.9217105e+06 |  6.147141e+06 | +3.81%
        9 |        9 |  5.477262e+06 | 5.9889475e+06 | +9.34%
       10 |       10 | 5.2147235e+06 |  5.858963e+06 | +12.35%
       11 |       11 |  4.882895e+06 | 5.6766675e+06 | +16.26%
       12 |       12 |   4.61105e+06 |  5.559544e+06 | +20.57%
       13 |       13 |  4.382494e+06 | 5.3770165e+06 | +22.69%
       14 |       14 |  4.134509e+06 |  5.256462e+06 | +27.14%
       15 |       15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
       16 |       16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
       17 |       17 | 3.9258755e+06 |  4.935394e+06 | +25.71%
       18 |       18 | 3.7562495e+06 | 4.8723875e+06 | +29.71%
       19 |       19 | 3.4890312e+06 |  4.723648e+06 | +35.39%
       20 |       20 |  3.289758e+06 | 4.6569305e+06 | +41.56%
       21 |       21 |  3.103908e+06 | 4.4747755e+06 | +44.17%
       22 |       22 | 2.9545148e+06 | 4.4227305e+06 | +49.69%
       23 |       23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
       24 |       24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
       25 |       25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
       26 |       26 | 2.8158568e+06 | 4.0437498e+06 | +43.61%
       27 |       27 | 2.6376518e+06 | 3.8959785e+06 | +47.71%
       28 |       28 | 2.5094318e+06 | 3.8648428e+06 | +54.01%
       29 |       29 | 2.3714905e+06 |   3.67182e+06 | +54.83%
       30 |       30 | 2.2456015e+06 | 3.6337285e+06 | +61.82%
       31 |       31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
       32 |       32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
       33 |       33 |   2.27133e+06 |  3.435459e+06 | +51.25%
       34 |       34 | 2.1836462e+06 | 3.4042262e+06 | +55.90%
       35 |       35 | 2.0753196e+06 | 3.2125422e+06 | +54.80%
       36 |       36 | 1.9650498e+06 |  3.185525e+06 | +62.11%
       37 |       37 | 1.8668318e+06 | 3.0366508e+06 | +62.66%
       38 |       38 | 1.7678832e+06 |  3.014941e+06 | +70.54%
       39 |       39 | 1.6764314e+06 | 3.1080448e+06 | +85.40%
       40 |       40 | 1.9170026e+06 | 3.0942025e+06 | +61.41%
       50 |       50 | 1.2242934e+06 | 2.3769868e+06 | +94.15%
      100 |      100 |     401733.62 | 1.0854601e+06 | +170.19%
      250 |      250 |      81861.45 |     249837.78 | +205.20%
      500 |      500 |     21613.402 |      71239.04 | +229.61%
     1000 |     1000 |      5551.617 |     18349.414 | +230.52%
     2500 |     2500 |       906.501 |     3107.6462 | +242.82%
     5000 |     5000 |     231.65045 |     794.86444 | +243.13%
    10000 |    10000 |     58.372395 |     188.37387 | +222.71%
    16383 |    16383 |     21.629467 |      73.58552 | +240.21%

Unbalanced inputs:

 ndigits1 | ndigits2 | PG17 rate | patch rate | % change
----------+----------+-----------+------------+----------
        1 |    10000 | 44137.258 |  62292.844 | +41.13%
        2 |    10000 | 42009.895 |  62705.445 | +49.26%
        3 |    10000 | 40569.617 |  58555.727 | +44.33%
        4 |    10000 |  38914.03 |  49439.008 | +27.05%
        5 |    10000 |  37361.39 |  47173.445 | +26.26%
        6 |    10000 |  35807.61 |  42609.203 | +18.99%
        7 |    10000 | 34386.684 |  49325.582 | +43.44%
        8 |    10000 | 33380.312 |   49298.59 | +47.69%
        9 |    10000 |  32228.17 |  46869.844 | +45.43%
       10 |    10000 |  31022.46 |   47015.98 | +51.55%
       11 |    10000 | 29782.623 |      45074 | +51.34%
       12 |    10000 | 29540.896 |   44712.23 | +51.36%
       13 |    10000 | 28521.643 |   42589.98 | +49.33%
       14 |    10000 |  27772.59 |  43325.863 | +56.00%
       15 |    10000 |  26871.25 |      41443 | +54.23%
       16 |    10000 | 26179.322 |  41245.508 | +57.55%
       17 |    10000 | 26367.402 |  39348.543 | +49.23%
       18 |    10000 | 25769.176 |  40105.402 | +55.63%
       19 |    10000 | 24869.504 |   38316.44 | +54.07%
       20 |    10000 | 24395.436 |   37647.33 | +54.32%
       21 |    10000 | 23532.748 |  36552.914 | +55.33%
       22 |    10000 | 23151.842 |   36824.04 | +59.05%
       23 |    10000 |  22661.33 |  35556.918 | +56.91%
       24 |    10000 | 22113.502 |   34923.83 | +57.93%
       25 |    10000 | 21481.773 |  33601.785 | +56.42%
       26 |    10000 | 20943.576 |   34277.58 | +63.67%
       27 |    10000 | 20437.605 |  32957.406 | +61.26%
       28 |    10000 |  20049.12 |   32413.64 | +61.67%
       29 |    10000 | 19674.787 |  31537.846 | +60.30%
       30 |    10000 | 19092.572 |  32252.404 | +68.93%
       31 |    10000 | 18761.932 |  30825.836 | +64.30%
       32 |    10000 | 18480.184 |  30616.389 | +65.67%
       33 |    10000 |  18130.89 |  29493.594 | +62.67%
       34 |    10000 | 17750.996 |   30054.01 | +69.31%
       35 |    10000 |  17406.83 |  29090.297 | +67.12%
       36 |    10000 |  17138.23 |   29117.42 | +69.90%
       37 |    10000 | 16666.799 |   28429.32 | +70.57%
       38 |    10000 | 16144.025 |  29082.398 | +80.14%
       39 |    10000 | 15548.838 |  28195.258 | +81.33%
       40 |    10000 | 15305.571 |  27273.215 | +78.19%
       50 |    10000 | 11099.766 |  25494.129 | +129.68%
      100 |    10000 | 6310.7827 |  14895.447 | +136.03%
      250 |    10000 | 2687.7397 |  7149.1016 | +165.99%
      500 |    10000 | 1354.7455 |  3608.8845 | +166.39%
     1000 |    10000 |  677.3838 |  1852.1256 | +173.42%
     2500 |    10000 | 269.74582 |   748.5785 | +177.51%
     5000 |    10000 |  132.6432 |  377.23288 | +184.40%

numeric_big regression test:

ok 224       - numeric_big                               256 ms  [PG17]
ok 224       - numeric_big                               161 ms  [v3 patch]

Regards,
Dean

From 9d22b244e257d2e4cccc321b7d5ed6d90f5ea3a4 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rash...@gmail.com>
Date: Thu, 18 Jul 2024 18:32:56 +0100
Subject: [PATCH v3 2/2] Optimise numeric multiplication using base-NBASE^2
 arithmetic.

Currently mul_var() uses the schoolbook multiplication algorithm,
which is O(n^2) in the number of NBASE digits. To improve performance
for large inputs, convert the inputs to base NBASE^2 before
multiplying, which effectively halves the number of digits in each
input, theoretically speeding up the computation by a factor of 4. In
practice, the actual speedup for large inputs varies between around 3
and 6 times, depending on the system and compiler used. In turn, this
significantly reduces the runtime of the numeric_big regression test.

For this to work, 64-bit integers are required for the products of
base-NBASE^2 digits, so this works best on 64-bit machines, for which
it is faster whenever the shorter input has more than 4 or 5 NBASE
digits. On 32-bit machines, the additional overheads, especially
during carry propagation and the final conversion back to base-NBASE,
are significantly higher, and it is only faster when the shorter input
has more than around 30 NBASE digits. Therefore, only use this
approach above a platform-dependent threshold.

For inputs below the threshold, the original base-NBASE algorithm is
used, except that it can be simplified because the threshold is low
enough that intermediate carry-propagation passes are not required.
Above the threshold, the available headroom in 64-bit integers is much
larger than for 32-bit integers, so the frequency of carry-propagation
passes is greatly reduced. In addition, unsigned integers are used
throughout, further increasing the headroom.
---
 src/backend/utils/adt/numeric.c | 512 +++++++++++++++++++++++++-------
 1 file changed, 401 insertions(+), 111 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 9b9b88662a..c463901428 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -101,6 +101,29 @@ typedef signed char NumericDigit;
 typedef int16 NumericDigit;
 #endif
 
+/*
+ * Above a certain size threshold, it is faster to multiply numbers by
+ * converting them to base NBASE^2, and use 64-bit integer arithmetic. This
+ * threshold is determined empirically, and is necessarily higher on 32-bit
+ * machines, which are less efficient at performing 64-bit arithmetic.
+ *
+ * To simplify the computation below this threshold, it intentially kept below
+ * the point at which intermediate carry-propagation passes may be necessary.
+ * Therefore, as explained in mul_var(), it must be no larger than
+ * (PG_UINT32_MAX - PG_UINT32_MAX / NBASE) / (NBASE - 1)^2, which is 42 when
+ * NBASE is 10000.
+ */
+#define MUL_64BIT_THRESHOLD_MAX \
+	((PG_UINT32_MAX - PG_UINT32_MAX / NBASE) / (NBASE - 1) / (NBASE - 1))
+
+#if SIZEOF_DATUM < 8
+#define MUL_64BIT_THRESHOLD 30
+#else
+#define MUL_64BIT_THRESHOLD 4
+#endif
+
+#define NBASE_SQR	(NBASE * NBASE)
+
 /*
  * The Numeric type as stored on disk.
  *
@@ -8663,6 +8686,85 @@ sub_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result)
 }
 
 
+/*
+ * div_mod_NBASE_SQR() -
+ *
+ *	Divide a 64-bit integer "num" by NBASE_SQR, returning the quotient and
+ *	remainder.  Technically, the remainder could be a 32-bit integer, but the
+ *	caller actually wants a 64-bit integer, so this is more efficient.
+ */
+static inline uint64
+div_mod_NBASE_SQR(uint64 num, uint64 *rem)
+{
+	uint64		quot;
+
+	/* ----------
+	 * On a 32-bit machine, the compiler does 64-bit division using a builtin
+	 * function such as __udivdi3(), which is very slow.  Replace that with a
+	 * multiply-and-shift algorithm, based on the way compilers do it on
+	 * 64-bit machines.  Assuming that DEC_DIGITS is 4, and NBASE_SQR = 10^8,
+	 * the multiply-and-shift formula is
+	 *
+	 *	quot = num / 10^8 = (num * multiplier) >> 90
+	 *
+	 * where multiplier = ceil(2^90 / 10^8) = 12379400392853802749.
+	 *
+	 * The 2^90 scaling factor here guarantees correct results for all inputs.
+	 * See "Division by Invariant Integers using Multiplication", Torbjorn
+	 * Granlund and Peter L. Montgomery, PLDI '94: Proceedings of the ACM
+	 * SIGPLAN 1994 conference on Programming language design and
+	 * implementation (https://dl.acm.org/doi/pdf/10.1145/178243.178249).
+	 *
+	 * Since num and multiplier are 64-bit unsigned integers, their product is
+	 * a 128-bit unsigned integer, but we only require the high part.  This is
+	 * done by decomposing num and multiplier into high and low 32-bit parts,
+	 * and then computing the upper 64 bits of the full product:
+	 *
+	 *	num * multiplier =
+	 *		(num_hi * multiplier_hi) << 64 +
+	 *		(num_hi * multiplier_lo + num_lo * multiplier_hi) << 32 +
+	 *		num_lo * multiplier_lo
+	 *
+	 * We don't bother with this optimization for other NBASE values.
+	 * ----------
+	 */
+#if SIZEOF_DATUM < 8 && DEC_DIGITS == 4
+	const uint64 multiplier = UINT64CONST(12379400392853802749);
+
+	/* high and low 32-bit parts of num and multiplier */
+#define UINT64_HI32(x) ((uint32) ((x) >> 32))
+#define UINT64_LO32(x) ((uint32) (x))
+	const uint32 multiplier_hi = UINT64_HI32(multiplier);
+	const uint32 multiplier_lo = UINT64_LO32(multiplier);
+	uint32		num_hi = UINT64_HI32(num);
+	uint32		num_lo = UINT64_LO32(num);
+	uint64		tmp1,
+				tmp2,
+				prod_hi;
+
+	/* high 64-bit part of 128-bit product */
+	tmp1 = (uint64) num_hi * multiplier_lo;
+	tmp2 = (uint64) num_lo * multiplier_lo;
+	tmp1 += UINT64_HI32(tmp2);
+	prod_hi = (uint64) num_hi * multiplier_hi;
+	prod_hi += UINT64_HI32(tmp1);
+	tmp2 = (uint64) num_lo * multiplier_hi;
+	tmp2 += UINT64_LO32(tmp1);
+	prod_hi += UINT64_HI32(tmp2);
+
+	/* quotient is in the top 38 bits */
+	quot = prod_hi >> 26;
+#else
+	/* just divide normally */
+	quot = num / NBASE_SQR;
+#endif
+	/* remainder */
+	*rem = num - quot * NBASE_SQR;
+
+	return quot;
+}
+
+
 /*
  * mul_var() -
  *
@@ -8677,10 +8779,6 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	int			res_sign;
 	int			res_weight;
 	int			maxdigits;
-	int		   *dig;
-	int			carry;
-	int			maxdig;
-	int			newdig;
 	int			var1ndigits;
 	int			var2ndigits;
 	NumericDigit *var1digits;
@@ -8688,7 +8786,8 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	NumericDigit *res_digits;
 	int			i,
 				i1,
-				i2;
+				i2,
+				i2limit;
 
 	/*
 	 * Arrange for var1 to be the shorter of the two numbers.  This improves
@@ -8729,141 +8828,332 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		return;
 	}
 
-	/* Determine result sign and (maximum possible) weight */
+	/* Determine result sign */
 	if (var1->sign == var2->sign)
 		res_sign = NUMERIC_POS;
 	else
 		res_sign = NUMERIC_NEG;
-	res_weight = var1->weight + var2->weight + 2;
 
 	/*
-	 * Determine the number of result digits to compute.  If the exact result
-	 * would have more than rscale fractional digits, truncate the computation
-	 * with MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that
-	 * would only contribute to the right of that.  (This will give the exact
-	 * rounded-to-rscale answer unless carries out of the ignored positions
-	 * would have propagated through more than MUL_GUARD_DIGITS digits.)
+	 * We do the arithmetic in an array "dig[]" of unsigned 32-bit or 64-bit
+	 * integers, depending on the size of var1.
 	 *
-	 * Note: an exact computation could not produce more than var1ndigits +
-	 * var2ndigits digits, but we allocate one extra output digit in case
-	 * rscale-driven rounding produces a carry out of the highest exact digit.
-	 */
-	res_ndigits = var1ndigits + var2ndigits + 1;
-	maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
-		MUL_GUARD_DIGITS;
-	res_ndigits = Min(res_ndigits, maxdigits);
+	 * If var1 has more than MUL_64BIT_THRESHOLD digits, we convert the inputs
+	 * to base NBASE^2 and multiply using 64-bit integer arithmetic, which is
+	 * much faster, since schoolbook multiplication is O(N^2) in the number of
+	 * input digits, and working in base NBASE^2 effectively halves "N".
+	 *
+	 * Below this threshold, we work with the original base-NBASE numbers, and
+	 * use 32-bit integer arithmetic.  To simplify the algorithm, we ensure
+	 * that the threshold is low enough so that the number of products being
+	 * added to any element of dig[] is small enough to avoid integer
+	 * overflow.  Furthermore, we need to ensure that overflow doesn't occur
+	 * during the final carry-propagation pass.  The carry values can be as
+	 * large as PG_UINT32_MAX / NBASE, and so the values in dig[] must not
+	 * exceed PG_UINT32_MAX - PG_UINT32_MAX / NBASE.  Since each product of
+	 * digits is at most (NBASE - 1)^2, the number of products must not exceed
+	 * (PG_UINT32_MAX - PG_UINT32_MAX / NBASE) / (NBASE - 1)^2.
+	 */
+	StaticAssertStmt(MUL_64BIT_THRESHOLD <= MUL_64BIT_THRESHOLD_MAX,
+					 "MUL_64BIT_THRESHOLD must not exceed MUL_64BIT_THRESHOLD_MAX");
+
+	if (var1ndigits <= MUL_64BIT_THRESHOLD)
+	{
+		uint32	   *dig,
+				   *dig_i1_2;
+		NumericDigit var1digit;
+		uint32		carry;
+		uint32		newdig;
 
-	if (res_ndigits < 3)
-	{
-		/* All input digits will be ignored; so result is zero */
-		zero_var(result);
-		result->dscale = rscale;
-		return;
-	}
+		/*
+		 * Determine the number of result digits to compute and the (maximum
+		 * possible) result weight.  If the exact result would have more than
+		 * rscale fractional digits, truncate the computation with
+		 * MUL_GUARD_DIGITS guard digits, i.e., ignore input digits that would
+		 * only contribute to the right of that.  (This will give the exact
+		 * rounded-to-rscale answer unless carries out of the ignored
+		 * positions would have propagated through more than MUL_GUARD_DIGITS
+		 * digits.)
+		 *
+		 * Note: an exact computation could not produce more than var1ndigits
+		 * + var2ndigits digits, but we allocate at least one extra output
+		 * digit in case rscale-driven rounding produces a carry out of the
+		 * highest exact digit.
+		 */
+		res_ndigits = var1ndigits + var2ndigits + 1;
+		res_weight = var1->weight + var2->weight + 2;
 
-	/*
-	 * We do the arithmetic in an array "dig[]" of signed int's.  Since
-	 * INT_MAX is noticeably larger than NBASE*NBASE, this gives us headroom
-	 * to avoid normalizing carries immediately.
-	 *
-	 * maxdig tracks the maximum possible value of any dig[] entry; when this
-	 * threatens to exceed INT_MAX, we take the time to propagate carries.
-	 * Furthermore, we need to ensure that overflow doesn't occur during the
-	 * carry propagation passes either.  The carry values could be as much as
-	 * INT_MAX/NBASE, so really we must normalize when digits threaten to
-	 * exceed INT_MAX - INT_MAX/NBASE.
-	 *
-	 * To avoid overflow in maxdig itself, it actually represents the max
-	 * possible value divided by NBASE-1, ie, at the top of the loop it is
-	 * known that no dig[] entry exceeds maxdig * (NBASE-1).
-	 */
-	dig = (int *) palloc0(res_ndigits * sizeof(int));
-	maxdig = 0;
+		maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+			MUL_GUARD_DIGITS;
+		res_ndigits = Min(res_ndigits, maxdigits);
 
-	/*
-	 * The least significant digits of var1 should be ignored if they don't
-	 * contribute directly to the first res_ndigits digits of the result that
-	 * we are computing.
-	 *
-	 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to digit
-	 * i1+i2+2 of the accumulator array, so we need only consider digits of
-	 * var1 for which i1 <= res_ndigits - 3.
-	 */
-	for (i1 = Min(var1ndigits - 1, res_ndigits - 3); i1 >= 0; i1--)
-	{
-		NumericDigit var1digit = var1digits[i1];
+		if (res_ndigits < 3)
+		{
+			/* All input digits will be ignored; so result is zero */
+			zero_var(result);
+			result->dscale = rscale;
+			return;
+		}
 
-		if (var1digit == 0)
-			continue;
+		/* Allocate dig[] to accumulate the digit products */
+		dig = (uint32 *) palloc(res_ndigits * sizeof(uint32));
+
+		/*
+		 * Start by multiplying var2 by the least significant contributing
+		 * digit of var1, storing the results at the end of dig[], and filling
+		 * the leading slots with zeros.
+		 *
+		 * The least significant digits of var1 should be ignored if they
+		 * don't contribute directly to the first res_ndigits digits of the
+		 * result that we are computing.
+		 *
+		 * Digit i1 of var1 and digit i2 of var2 are multiplied and added to
+		 * digit i1+i2+2 of the accumulator array, so we need only consider
+		 * digits of var1 for which i1 <= res_ndigits - 3.
+		 *
+		 * The loop here is the same as the inner loop below, except that we
+		 * set the results in dig[], rather than adding to them.  This is the
+		 * performance bottleneck for multiplication, so we want to keep it
+		 * simple enough so that it can be auto-vectorized.  Accordingly,
+		 * process the digits left-to-right even though schoolbook
+		 * multiplication would suggest right-to-left.  Since we aren't
+		 * propagating carries in this loop, the order does not matter.
+		 */
+		i1 = Min(var1ndigits - 1, res_ndigits - 3);
+		var1digit = var1digits[i1];
+
+		i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+		dig_i1_2 = &dig[i1 + 2];
+
+		memset(dig, 0, (i1 + 2) * sizeof(uint32));
+		for (i2 = 0; i2 < i2limit; i2++)
+			dig_i1_2[i2] = var1digit * var2digits[i2];
 
-		/* Time to normalize? */
-		maxdig += var1digit;
-		if (maxdig > (INT_MAX - INT_MAX / NBASE) / (NBASE - 1))
+		/*
+		 * Next, multiply var2 by the remaining digits of var1, adding the
+		 * results to dig[] at the appropriate offsets.
+		 */
+		for (i1 = i1 - 1; i1 >= 0; i1--)
 		{
-			/* Yes, do it */
-			carry = 0;
-			for (i = res_ndigits - 1; i >= 0; i--)
+			var1digit = var1digits[i1];
+			if (var1digit != 0)
 			{
-				newdig = dig[i] + carry;
-				if (newdig >= NBASE)
-				{
-					carry = newdig / NBASE;
-					newdig -= carry * NBASE;
-				}
-				else
-					carry = 0;
-				dig[i] = newdig;
+				i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
+				dig_i1_2 = &dig[i1 + 2];
+
+				for (i2 = 0; i2 < i2limit; i2++)
+					dig_i1_2[i2] += var1digit * var2digits[i2];
 			}
-			Assert(carry == 0);
-			/* Reset maxdig to indicate new worst-case */
-			maxdig = 1 + var1digit;
 		}
 
 		/*
-		 * Add the appropriate multiple of var2 into the accumulator.
+		 * Finally, construct the result digits by propagating carries up,
+		 * normalizing back to base-NBASE.  Note that this is still done at
+		 * full precision w/guard digits.
+		 */
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		carry = 0;
+		for (i = res_ndigits - 1; i >= 0; i--)
+		{
+			newdig = dig[i] + carry;
+			if (newdig >= NBASE)
+			{
+				carry = newdig / NBASE;
+				newdig -= carry * NBASE;
+			}
+			else
+				carry = 0;
+			res_digits[i] = newdig;
+		}
+		Assert(carry == 0);
+
+		pfree(dig);
+	}
+	else
+	{
+		int			var1ndigitpairs;
+		int			var2ndigitpairs;
+		int			res_ndigitpairs;
+		int			pair_offset;
+		int			maxdigitpairs;
+		uint64	   *dig,
+				   *dig_i1_off;
+		uint32	   *var2digitpairs;
+		uint32		var1digitpair;
+		uint64		maxdig;
+		uint64		carry;
+		uint64		newdig;
+
+		/*
+		 * As above, determine the number of result digits to compute and the
+		 * (maximum possible) result weight, except that here we will be
+		 * working in base NBASE^2 and so we process the digits of each input
+		 * in pairs.
+		 */
+		/* digit pairs in each input */
+		var1ndigitpairs = (var1ndigits + 1) / 2;
+		var2ndigitpairs = (var2ndigits + 1) / 2;
+
+		/* digits in exact result */
+		res_ndigits = var1ndigits + var2ndigits;
+
+		/* digit pairs in exact result with at least one extra output digit */
+		res_ndigitpairs = res_ndigits / 2 + 1;
+
+		/* pair offset to align result to end of dig[] */
+		pair_offset = res_ndigitpairs - var1ndigitpairs - var2ndigitpairs + 1;
+
+		/* maximum possible result weight */
+		res_weight = var1->weight + var2->weight + 1 + 2 * res_ndigitpairs -
+			res_ndigits;
+
+		/* truncate computation based on requested rscale */
+		maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+			MUL_GUARD_DIGITS;
+		maxdigitpairs = (maxdigits + 1) / 2;
+
+		res_ndigitpairs = Min(res_ndigitpairs, maxdigitpairs);
+		res_ndigits = 2 * res_ndigitpairs;
+
+		if (res_ndigitpairs <= pair_offset)
+		{
+			/* All input digits will be ignored; so result is zero */
+			zero_var(result);
+			result->dscale = rscale;
+			return;
+		}
+		var1ndigitpairs = Min(var1ndigitpairs, res_ndigitpairs - pair_offset);
+		var2ndigitpairs = Min(var2ndigitpairs, res_ndigitpairs - pair_offset);
+
+		/*
+		 * Since we will be working in base NBASE^2, we make dig[] an array of
+		 * unsigned 64-bit integers, and since PG_UINT64_MAX is much larger
+		 * than NBASE^4, this gives us a lot of headroom to avoid normalizing
+		 * carries immediately.
+		 *
+		 * maxdig tracks the maximum possible value of any dig[] entry; when
+		 * this threatens to exceed PG_UINT64_MAX, we take the time to
+		 * propagate carries.  Furthermore, we need to ensure that overflow
+		 * doesn't occur during the carry propagation passes either.  The
+		 * carry values could be as much as PG_UINT64_MAX / NBASE^2, so really
+		 * we must normalize when digits threaten to exceed PG_UINT64_MAX -
+		 * PG_UINT64_MAX / NBASE^2.
 		 *
-		 * As above, digits of var2 can be ignored if they don't contribute,
-		 * so we only include digits for which i1+i2+2 < res_ndigits.
+		 * To avoid overflow in maxdig itself, it actually represents the
+		 * maximum possible value divided by NBASE^2-1, i.e., at the top of
+		 * the loop it is known that no dig[] entry exceeds maxdig *
+		 * (NBASE^2-1).
 		 *
-		 * This inner loop is the performance bottleneck for multiplication,
-		 * so we want to keep it simple enough so that it can be
-		 * auto-vectorized.  Accordingly, process the digits left-to-right
-		 * even though schoolbook multiplication would suggest right-to-left.
-		 * Since we aren't propagating carries in this loop, the order does
-		 * not matter.
+		 * The conversion of var1 to base NBASE^2 is done on the fly, as each
+		 * new digit is required.  The digits of var2 are converted upfront,
+		 * and stored at the end of dig[].  To avoid loss of precision, the
+		 * input digits are aligned with the start of digit pair array,
+		 * effectively shifting them up (multiplying by NBASE) if the input
+		 * has an odd number of NBASE digits.
+		 */
+		dig = (uint64 *) palloc(res_ndigitpairs * sizeof(uint64) +
+								var2ndigitpairs * sizeof(uint32));
+
+		/* convert var2 to base NBASE^2, shifting up if length is odd */
+		var2digitpairs = (uint32 *) (dig + res_ndigitpairs);
+
+		for (i2 = 0; i2 < var2ndigitpairs - 1; i2++)
+			var2digitpairs[i2] = var2digits[2 * i2] * NBASE + var2digits[2 * i2 + 1];
+
+		if (2 * i2 + 1 < var2ndigits)
+			var2digitpairs[i2] = var2digits[2 * i2] * NBASE + var2digits[2 * i2 + 1];
+		else
+			var2digitpairs[i2] = var2digits[2 * i2] * NBASE;
+
+		/*
+		 * As above, we start by multiplying var2 by the least significant
+		 * contributing digit pair from var1, storing the results at the end
+		 * of dig[], and filling the leading slots with zeros.
+		 *
+		 * Here, however, digit pair i1 of var1 and digit pair i2 of var2 are
+		 * multiplied and added to digit i1+i2+pair_offset of the accumulator
+		 * array.
+		 */
+		i1 = var1ndigitpairs - 1;
+		if (2 * i1 + 1 < var1ndigits)
+			var1digitpair = var1digits[2 * i1] * NBASE + var1digits[2 * i1 + 1];
+		else
+			var1digitpair = var1digits[2 * i1] * NBASE;
+		maxdig = var1digitpair;
+
+		i2limit = Min(var2ndigitpairs, res_ndigitpairs - i1 - pair_offset);
+		dig_i1_off = &dig[i1 + pair_offset];
+
+		memset(dig, 0, (i1 + pair_offset) * sizeof(uint64));
+		for (i2 = 0; i2 < i2limit; i2++)
+			dig_i1_off[i2] = (uint64) var1digitpair * var2digitpairs[i2];
+
+		/*
+		 * Next, multiply var2 by the remaining digit pairs of var1, adding
+		 * the results to dig[] at the appropriate offsets.
 		 */
+		for (i1 = i1 - 1; i1 >= 0; i1--)
 		{
-			int			i2limit = Min(var2ndigits, res_ndigits - i1 - 2);
-			int		   *dig_i1_2 = &dig[i1 + 2];
+			var1digitpair = var1digits[2 * i1] * NBASE + var1digits[2 * i1 + 1];
+			if (var1digitpair == 0)
+				continue;
+
+			/* Time to normalize? */
+			maxdig += var1digitpair;
+			if (maxdig > (PG_UINT64_MAX - PG_UINT64_MAX / NBASE_SQR) / (NBASE_SQR - 1))
+			{
+				/* Yes, do it (to base NBASE^2) */
+				carry = 0;
+				for (i = res_ndigitpairs - 1; i >= 0; i--)
+				{
+					newdig = dig[i] + carry;
+					if (newdig >= NBASE_SQR)
+						carry = div_mod_NBASE_SQR(newdig, &newdig);
+					else
+						carry = 0;
+					dig[i] = newdig;
+				}
+				Assert(carry == 0);
+				/* Reset maxdig to indicate new worst-case */
+				maxdig = 1 + var1digitpair;
+			}
+
+			/* Multiply and add */
+			i2limit = Min(var2ndigitpairs, res_ndigitpairs - i1 - pair_offset);
+			dig_i1_off = &dig[i1 + pair_offset];
 
 			for (i2 = 0; i2 < i2limit; i2++)
-				dig_i1_2[i2] += var1digit * var2digits[i2];
+				dig_i1_off[i2] += (uint64) var1digitpair * var2digitpairs[i2];
 		}
-	}
 
-	/*
-	 * Now we do a final carry propagation pass to normalize the result, which
-	 * we combine with storing the result digits into the output. Note that
-	 * this is still done at full precision w/guard digits.
-	 */
-	alloc_var(result, res_ndigits);
-	res_digits = result->digits;
-	carry = 0;
-	for (i = res_ndigits - 1; i >= 0; i--)
-	{
-		newdig = dig[i] + carry;
-		if (newdig >= NBASE)
+		/*
+		 * Now we do a final carry propagation pass to normalize back to base
+		 * NBASE^2, and construct the base-NBASE result digits.
+		 */
+		alloc_var(result, res_ndigits);
+		res_digits = result->digits;
+		carry = 0;
+		for (i = res_ndigitpairs - 1; i >= 0; i--)
 		{
-			carry = newdig / NBASE;
-			newdig -= carry * NBASE;
+			newdig = dig[i] + carry;
+			if (newdig >= NBASE_SQR)
+				carry = div_mod_NBASE_SQR(newdig, &newdig);
+			else
+				carry = 0;
+			res_digits[2 * i + 1] = (NumericDigit) ((uint32) newdig % NBASE);
+			res_digits[2 * i] = (NumericDigit) ((uint32) newdig / NBASE);
 		}
-		else
-			carry = 0;
-		res_digits[i] = newdig;
-	}
-	Assert(carry == 0);
+		Assert(carry == 0);
 
-	pfree(dig);
+		pfree(dig);
+
+		/*
+		 * Adjust the result weight, if the inputs were shifted up during base
+		 * conversion (if they had an odd number of NBASE digits).
+		 */
+		res_weight -= (var1ndigits & 1) + (var2ndigits & 1);
+	}
 
 	/*
 	 * Finally, round the result to the requested precision.
-- 
2.35.3

From 7d531ced553d960dac44df90cce9f462d93f3813 Mon Sep 17 00:00:00 2001
From: Dean Rasheed <dean.a.rash...@gmail.com>
Date: Thu, 18 Jul 2024 17:38:59 +0100
Subject: [PATCH v3 1/2] Extend mul_var_short() to 5 and 6-digit inputs.

Commit ca481d3c9a introduced mul_var_short(), which is used by
mul_var() whenever the shorter input has 1-4 NBASE digits and the
exact product is requested. As speculated on in that commit, it can be
extended to work for more digits in the shorter input. This commit
extends it up to 6 NBASE digits (21-24 decimal digits), for which it
also gives a significant speedup.

To avoid excessive code bloat and duplication, refactor it a bit using
macros and exploiting the fact that some portions of the code are
shared between the different cases.
---
 src/backend/utils/adt/numeric.c | 173 ++++++++++++++++++++++----------
 1 file changed, 122 insertions(+), 51 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index d0f0923710..9b9b88662a 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -8720,10 +8720,10 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	}
 
 	/*
-	 * If var1 has 1-4 digits and the exact result was requested, delegate to
+	 * If var1 has 1-6 digits and the exact result was requested, delegate to
 	 * mul_var_short() which uses a faster direct multiplication algorithm.
 	 */
-	if (var1ndigits <= 4 && rscale == var1->dscale + var2->dscale)
+	if (var1ndigits <= 6 && rscale == var1->dscale + var2->dscale)
 	{
 		mul_var_short(var1, var2, result);
 		return;
@@ -8882,7 +8882,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 /*
  * mul_var_short() -
  *
- *	Special-case multiplication function used when var1 has 1-4 digits, var2
+ *	Special-case multiplication function used when var1 has 1-6 digits, var2
  *	has at least as many digits as var1, and the exact product var1 * var2 is
  *	requested.
  */
@@ -8904,7 +8904,7 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 
 	/* Check preconditions */
 	Assert(var1ndigits >= 1);
-	Assert(var1ndigits <= 4);
+	Assert(var1ndigits <= 6);
 	Assert(var2ndigits >= var1ndigits);
 
 	/*
@@ -8931,6 +8931,13 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 	 * carry up as we go.  The i'th result digit consists of the sum of the
 	 * products var1digits[i1] * var2digits[i2] for which i = i1 + i2 + 1.
 	 */
+#define PRODSUM1(v1,i1,v2,i2) ((v1)[i1] * (v2)[i2])
+#define PRODSUM2(v1,i1,v2,i2) (PRODSUM1(v1,i1,v2,i2) + (v1)[i1+1] * (v2)[i2-1])
+#define PRODSUM3(v1,i1,v2,i2) (PRODSUM2(v1,i1,v2,i2) + (v1)[i1+2] * (v2)[i2-2])
+#define PRODSUM4(v1,i1,v2,i2) (PRODSUM3(v1,i1,v2,i2) + (v1)[i1+3] * (v2)[i2-3])
+#define PRODSUM5(v1,i1,v2,i2) (PRODSUM4(v1,i1,v2,i2) + (v1)[i1+4] * (v2)[i2-4])
+#define PRODSUM6(v1,i1,v2,i2) (PRODSUM5(v1,i1,v2,i2) + (v1)[i1+5] * (v2)[i2-5])
+
 	switch (var1ndigits)
 	{
 		case 1:
@@ -8944,7 +8951,7 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 			carry = 0;
 			for (int i = res_ndigits - 2; i >= 0; i--)
 			{
-				term = (uint32) var1digits[0] * var2digits[i] + carry;
+				term = PRODSUM1(var1digits, 0, var2digits, i) + carry;
 				res_digits[i + 1] = (NumericDigit) (term % NBASE);
 				carry = term / NBASE;
 			}
@@ -8960,23 +8967,17 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 			 * ----------
 			 */
 			/* last result digit and carry */
-			term = (uint32) var1digits[1] * var2digits[res_ndigits - 3];
+			term = PRODSUM1(var1digits, 1, var2digits, var2ndigits - 1);
 			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
 			/* remaining digits, except for the first two */
-			for (int i = res_ndigits - 3; i >= 1; i--)
+			for (int i = var2ndigits - 1; i >= 1; i--)
 			{
-				term = (uint32) var1digits[0] * var2digits[i] +
-					(uint32) var1digits[1] * var2digits[i - 1] + carry;
+				term = PRODSUM2(var1digits, 0, var2digits, i) + carry;
 				res_digits[i + 1] = (NumericDigit) (term % NBASE);
 				carry = term / NBASE;
 			}
-
-			/* first two digits */
-			term = (uint32) var1digits[0] * var2digits[0] + carry;
-			res_digits[1] = (NumericDigit) (term % NBASE);
-			res_digits[0] = (NumericDigit) (term / NBASE);
 			break;
 
 		case 3:
@@ -8988,34 +8989,21 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 			 * ----------
 			 */
 			/* last two result digits */
-			term = (uint32) var1digits[2] * var2digits[res_ndigits - 4];
+			term = PRODSUM1(var1digits, 2, var2digits, var2ndigits - 1);
 			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
-			term = (uint32) var1digits[1] * var2digits[res_ndigits - 4] +
-				(uint32) var1digits[2] * var2digits[res_ndigits - 5] + carry;
+			term = PRODSUM2(var1digits, 1, var2digits, var2ndigits - 1) + carry;
 			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
 			/* remaining digits, except for the first three */
-			for (int i = res_ndigits - 4; i >= 2; i--)
+			for (int i = var2ndigits - 1; i >= 2; i--)
 			{
-				term = (uint32) var1digits[0] * var2digits[i] +
-					(uint32) var1digits[1] * var2digits[i - 1] +
-					(uint32) var1digits[2] * var2digits[i - 2] + carry;
+				term = PRODSUM3(var1digits, 0, var2digits, i) + carry;
 				res_digits[i + 1] = (NumericDigit) (term % NBASE);
 				carry = term / NBASE;
 			}
-
-			/* first three digits */
-			term = (uint32) var1digits[0] * var2digits[1] +
-				(uint32) var1digits[1] * var2digits[0] + carry;
-			res_digits[2] = (NumericDigit) (term % NBASE);
-			carry = term / NBASE;
-
-			term = (uint32) var1digits[0] * var2digits[0] + carry;
-			res_digits[1] = (NumericDigit) (term % NBASE);
-			res_digits[0] = (NumericDigit) (term / NBASE);
 			break;
 
 		case 4:
@@ -9027,45 +9015,128 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2,
 			 * ----------
 			 */
 			/* last three result digits */
-			term = (uint32) var1digits[3] * var2digits[res_ndigits - 5];
+			term = PRODSUM1(var1digits, 3, var2digits, var2ndigits - 1);
 			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
-			term = (uint32) var1digits[2] * var2digits[res_ndigits - 5] +
-				(uint32) var1digits[3] * var2digits[res_ndigits - 6] + carry;
+			term = PRODSUM2(var1digits, 2, var2digits, var2ndigits - 1) + carry;
 			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
-			term = (uint32) var1digits[1] * var2digits[res_ndigits - 5] +
-				(uint32) var1digits[2] * var2digits[res_ndigits - 6] +
-				(uint32) var1digits[3] * var2digits[res_ndigits - 7] + carry;
+			term = PRODSUM3(var1digits, 1, var2digits, var2ndigits - 1) + carry;
 			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
 			/* remaining digits, except for the first four */
-			for (int i = res_ndigits - 5; i >= 3; i--)
+			for (int i = var2ndigits - 1; i >= 3; i--)
 			{
-				term = (uint32) var1digits[0] * var2digits[i] +
-					(uint32) var1digits[1] * var2digits[i - 1] +
-					(uint32) var1digits[2] * var2digits[i - 2] +
-					(uint32) var1digits[3] * var2digits[i - 3] + carry;
+				term = PRODSUM4(var1digits, 0, var2digits, i) + carry;
 				res_digits[i + 1] = (NumericDigit) (term % NBASE);
 				carry = term / NBASE;
 			}
+			break;
 
-			/* first four digits */
-			term = (uint32) var1digits[0] * var2digits[2] +
-				(uint32) var1digits[1] * var2digits[1] +
-				(uint32) var1digits[2] * var2digits[0] + carry;
-			res_digits[3] = (NumericDigit) (term % NBASE);
+		case 5:
+			/* ---------
+			 * 5-digit case:
+			 *		var1ndigits = 5
+			 *		var2ndigits >= 5
+			 *		res_ndigits = var2ndigits + 5
+			 * ----------
+			 */
+			/* last four result digits */
+			term = PRODSUM1(var1digits, 4, var2digits, var2ndigits - 1);
+			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
-			term = (uint32) var1digits[0] * var2digits[1] +
-				(uint32) var1digits[1] * var2digits[0] + carry;
-			res_digits[2] = (NumericDigit) (term % NBASE);
+			term = PRODSUM2(var1digits, 3, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = PRODSUM3(var1digits, 2, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
 			carry = term / NBASE;
 
-			term = (uint32) var1digits[0] * var2digits[0] + carry;
+			term = PRODSUM4(var1digits, 1, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 4] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first five */
+			for (int i = var2ndigits - 1; i >= 4; i--)
+			{
+				term = PRODSUM5(var1digits, 0, var2digits, i) + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			break;
+
+		case 6:
+			/* ---------
+			 * 6-digit case:
+			 *		var1ndigits = 6
+			 *		var2ndigits >= 6
+			 *		res_ndigits = var2ndigits + 6
+			 * ----------
+			 */
+			/* last five result digits */
+			term = PRODSUM1(var1digits, 5, var2digits, var2ndigits - 1);
+			res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = PRODSUM2(var1digits, 4, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = PRODSUM3(var1digits, 3, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = PRODSUM4(var1digits, 2, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 4] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			term = PRODSUM5(var1digits, 1, var2digits, var2ndigits - 1) + carry;
+			res_digits[res_ndigits - 5] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+
+			/* remaining digits, except for the first six */
+			for (int i = var2ndigits - 1; i >= 5; i--)
+			{
+				term = PRODSUM6(var1digits, 0, var2digits, i) + carry;
+				res_digits[i + 1] = (NumericDigit) (term % NBASE);
+				carry = term / NBASE;
+			}
+			break;
+	}
+
+	/*
+	 * Finally, for var1ndigits > 1, compute the remaining var1ndigits most
+	 * significant result digits.
+	 */
+	switch (var1ndigits)
+	{
+		case 6:
+			term = PRODSUM5(var1digits, 0, var2digits, 4) + carry;
+			res_digits[5] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			/* FALLTHROUGH */
+		case 5:
+			term = PRODSUM4(var1digits, 0, var2digits, 3) + carry;
+			res_digits[4] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			/* FALLTHROUGH */
+		case 4:
+			term = PRODSUM3(var1digits, 0, var2digits, 2) + carry;
+			res_digits[3] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			/* FALLTHROUGH */
+		case 3:
+			term = PRODSUM2(var1digits, 0, var2digits, 1) + carry;
+			res_digits[2] = (NumericDigit) (term % NBASE);
+			carry = term / NBASE;
+			/* FALLTHROUGH */
+		case 2:
+			term = PRODSUM1(var1digits, 0, var2digits, 0) + carry;
 			res_digits[1] = (NumericDigit) (term % NBASE);
 			res_digits[0] = (NumericDigit) (term / NBASE);
 			break;
-- 
2.35.3

Re: Optimize mul_var() for var1ndigits >= 8

Reply via email to