Re: Optimize mul_var() for var1ndigits >= 8

Dean Rasheed Mon, 29 Jul 2024 13:02:23 -0700

On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <j...@compiler.org> wrote:
>
> Thanks to v3-0002, they are all still significantly faster when both patches 
> have been applied,
> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
> bit slower?
>


There's no obvious reason why 0001 would make those cases slower, but
the fact that, together with 0002, it's a significant net win, and the
gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Something I did notice in my tests was that if ndigits was a small
multiple of 8, the old code was disproportionately faster, which can
be explained by the fact that the computation fits exactly into a
whole number of XMM register operations, with no remaining digits to
process. For example, these results from above:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
----------+----------+---------------+---------------+----------
       15 |       15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
       16 |       16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
       17 |       17 | 3.9258755e+06 |  4.935394e+06 | +25.71%

       23 |       23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
       24 |       24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
       25 |       25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%

       31 |       31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
       32 |       32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
       33 |       33 |   2.27133e+06 |  3.435459e+06 | +51.25%

(Note how 16x16 was much faster than 15x15, for example.)

The patched code seems to do a better job at levelling out and coping
with arbitrary-sized inputs, not just those that fit exactly into a
whole number of loops using SSE2 operations.

Something else I noticed was that the relative gains for large numbers
of digits were much higher with clang than with gcc:

gcc 13.3.0:

    16383 |    16383 |     21.629467 |      73.58552 | +240.21%

clang 15.0.7:

    16383 |    16383 |     11.562384 |      73.00517 | +531.40%

That seems to be because clang doesn't do a good job of generating
efficient SSE2 code in the old case of 16-bit x 16-bit
multiplications. Looking on godbolt.org, it generates
overly-complicated code using PMULUDQ, which actually does 32-bit x
32-bit multiplications. Gcc, on the other hand, generates a much more
compact loop, using PMULHW and PMULLW, which is much faster. With the
patch, they both generate the same SSE2 code, so the results are
pretty consistent.

Regards,
Dean

Re: Optimize mul_var() for var1ndigits >= 8

Reply via email to