Re: Optimize mul_var() for var1ndigits >= 8

Joel Jacobson Mon, 29 Jul 2024 13:39:48 -0700

On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote:
> On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <[email protected]> wrote:
>>
>> Thanks to v3-0002, they are all still significantly faster when both patches 
>> have been applied,
>> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
>> bit slower?
>>
>
> There's no obvious reason why 0001 would make those cases slower, but
> the fact that, together with 0002, it's a significant net win, and the
> gains for 5 and 6-digit inputs make it worthwhile, in my opinion.


Yes, I agree, I just thought it was noteworthy, but not a problem per se.

> Something I did notice in my tests was that if ndigits was a small
> multiple of 8, the old code was disproportionately faster, which can
> be explained by the fact that the computation fits exactly into a
> whole number of XMM register operations, with no remaining digits to
> process. For example, these results from above:
>
>  ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
> ----------+----------+---------------+---------------+----------
>        15 |       15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
>        16 |       16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
>        17 |       17 | 3.9258755e+06 |  4.935394e+06 | +25.71%
>
>        23 |       23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
>        24 |       24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
>        25 |       25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
>
>        31 |       31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
>        32 |       32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
>        33 |       33 |   2.27133e+06 |  3.435459e+06 | +51.25%
>
> (Note how 16x16 was much faster than 15x15, for example.)
>
> The patched code seems to do a better job at levelling out and coping
> with arbitrary-sized inputs, not just those that fit exactly into a
> whole number of loops using SSE2 operations.

That's nice.

> Something else I noticed was that the relative gains for large numbers
> of digits were much higher with clang than with gcc:
>
> gcc 13.3.0:
>
>     16383 |    16383 |     21.629467 |      73.58552 | +240.21%
>
> clang 15.0.7:
>
>     16383 |    16383 |     11.562384 |      73.00517 | +531.40%
>
> That seems to be because clang doesn't do a good job of generating
> efficient SSE2 code in the old case of 16-bit x 16-bit
> multiplications. Looking on godbolt.org, it generates
> overly-complicated code using PMULUDQ, which actually does 32-bit x
> 32-bit multiplications. Gcc, on the other hand, generates a much more
> compact loop, using PMULHW and PMULLW, which is much faster. With the
> patch, they both generate the same SSE2 code, so the results are
> pretty consistent.

Very nice.

I've now also had an initial look at the actual code of the patches:

* v3-0001

Looks pretty straight forward, nice with the PRODSUM macros,
that really improved readability a lot.

I like these simplifications, how `var2ndigits` is used instead of 
`res_ndigits`:
-                       for (int i = res_ndigits - 3; i >= 1; i--)
+                       for (int i = var2ndigits - 1; i >= 1; i--)

But I wonder why does `case 1:`  not follow the same pattern?
                        for (int i = res_ndigits - 2; i >= 0; i--)

* v3-0002

I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
the benefits of simpler code.

You brought up the question if 32-bit systems should be regarded
as legacy previously in this thread.

Unfortunately, we didn't get any feedback, so I'm starting a separate
thread, with subject "Is fast 32-bit code still important?", hoping to get
more input to help us make judgement calls.

/Joel

Re: Optimize mul_var() for var1ndigits >= 8

Reply via email to