On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote: > On Mon, 29 Jul 2024 at 18:57, Joel Jacobson <j...@compiler.org> wrote: >> >> Thanks to v3-0002, they are all still significantly faster when both patches >> have been applied, >> but I wonder if it is expected or not, that v3-0001 temporarily made them a >> bit slower? >> > > There's no obvious reason why 0001 would make those cases slower, but > the fact that, together with 0002, it's a significant net win, and the > gains for 5 and 6-digit inputs make it worthwhile, in my opinion.
Yes, I agree, I just thought it was noteworthy, but not a problem per se. > Something I did notice in my tests was that if ndigits was a small > multiple of 8, the old code was disproportionately faster, which can > be explained by the fact that the computation fits exactly into a > whole number of XMM register operations, with no remaining digits to > process. For example, these results from above: > > ndigits1 | ndigits2 | PG17 rate | patch rate | % change > ----------+----------+---------------+---------------+---------- > 15 | 15 | 3.7595882e+06 | 5.0751355e+06 | +34.99% > 16 | 16 | 4.3353435e+06 | 4.970363e+06 | +14.65% > 17 | 17 | 3.9258755e+06 | 4.935394e+06 | +25.71% > > 23 | 23 | 2.7975982e+06 | 4.5065035e+06 | +61.08% > 24 | 24 | 3.2456168e+06 | 4.4578115e+06 | +37.35% > 25 | 25 | 2.9515055e+06 | 4.0208335e+06 | +36.23% > > 31 | 31 | 2.169437e+06 | 3.7209152e+06 | +71.52% > 32 | 32 | 2.5022498e+06 | 3.6609378e+06 | +46.31% > 33 | 33 | 2.27133e+06 | 3.435459e+06 | +51.25% > > (Note how 16x16 was much faster than 15x15, for example.) > > The patched code seems to do a better job at levelling out and coping > with arbitrary-sized inputs, not just those that fit exactly into a > whole number of loops using SSE2 operations. That's nice. > Something else I noticed was that the relative gains for large numbers > of digits were much higher with clang than with gcc: > > gcc 13.3.0: > > 16383 | 16383 | 21.629467 | 73.58552 | +240.21% > > clang 15.0.7: > > 16383 | 16383 | 11.562384 | 73.00517 | +531.40% > > That seems to be because clang doesn't do a good job of generating > efficient SSE2 code in the old case of 16-bit x 16-bit > multiplications. Looking on godbolt.org, it generates > overly-complicated code using PMULUDQ, which actually does 32-bit x > 32-bit multiplications. Gcc, on the other hand, generates a much more > compact loop, using PMULHW and PMULLW, which is much faster. With the > patch, they both generate the same SSE2 code, so the results are > pretty consistent. Very nice. I've now also had an initial look at the actual code of the patches: * v3-0001 Looks pretty straight forward, nice with the PRODSUM macros, that really improved readability a lot. I like these simplifications, how `var2ndigits` is used instead of `res_ndigits`: - for (int i = res_ndigits - 3; i >= 1; i--) + for (int i = var2ndigits - 1; i >= 1; i--) But I wonder why does `case 1:` not follow the same pattern? for (int i = res_ndigits - 2; i >= 0; i--) * v3-0002 I think it's non-obvious if the separate code paths for 32-bit and 64-bit, using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs the benefits of simpler code. You brought up the question if 32-bit systems should be regarded as legacy previously in this thread. Unfortunately, we didn't get any feedback, so I'm starting a separate thread, with subject "Is fast 32-bit code still important?", hoping to get more input to help us make judgement calls. /Joel