On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <j...@compiler.org> wrote: > > Thanks to v3-0002, they are all still significantly faster when both patches > have been applied, > but I wonder if it is expected or not, that v3-0001 temporarily made them a > bit slower? >
There's no obvious reason why 0001 would make those cases slower, but the fact that, together with 0002, it's a significant net win, and the gains for 5 and 6-digit inputs make it worthwhile, in my opinion. Something I did notice in my tests was that if ndigits was a small multiple of 8, the old code was disproportionately faster, which can be explained by the fact that the computation fits exactly into a whole number of XMM register operations, with no remaining digits to process. For example, these results from above: ndigits1 | ndigits2 | PG17 rate | patch rate | % change ----------+----------+---------------+---------------+---------- 15 | 15 | 3.7595882e+06 | 5.0751355e+06 | +34.99% 16 | 16 | 4.3353435e+06 | 4.970363e+06 | +14.65% 17 | 17 | 3.9258755e+06 | 4.935394e+06 | +25.71% 23 | 23 | 2.7975982e+06 | 4.5065035e+06 | +61.08% 24 | 24 | 3.2456168e+06 | 4.4578115e+06 | +37.35% 25 | 25 | 2.9515055e+06 | 4.0208335e+06 | +36.23% 31 | 31 | 2.169437e+06 | 3.7209152e+06 | +71.52% 32 | 32 | 2.5022498e+06 | 3.6609378e+06 | +46.31% 33 | 33 | 2.27133e+06 | 3.435459e+06 | +51.25% (Note how 16x16 was much faster than 15x15, for example.) The patched code seems to do a better job at levelling out and coping with arbitrary-sized inputs, not just those that fit exactly into a whole number of loops using SSE2 operations. Something else I noticed was that the relative gains for large numbers of digits were much higher with clang than with gcc: gcc 13.3.0: 16383 | 16383 | 21.629467 | 73.58552 | +240.21% clang 15.0.7: 16383 | 16383 | 11.562384 | 73.00517 | +531.40% That seems to be because clang doesn't do a good job of generating efficient SSE2 code in the old case of 16-bit x 16-bit multiplications. Looking on godbolt.org, it generates overly-complicated code using PMULUDQ, which actually does 32-bit x 32-bit multiplications. Gcc, on the other hand, generates a much more compact loop, using PMULHW and PMULLW, which is much faster. With the patch, they both generate the same SSE2 code, so the results are pretty consistent. Regards, Dean