On Tue, 6 Feb 2024, Elena Ufimtseva wrote: > Hello Alexander > > On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amona...@ispras.ru> > wrote: > > > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD > > routines are invoked much more rarely in normal use when most buffers > > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra > > frequency and voltage transition periods during which the CPU operates > > at reduced performance, as described in > > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html > > > I would like to point out that the frequency scaling is not currently an > issue on AMD Zen4 Genoa CPUs, for example. > And microcode architecture description here: > https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf > Although, the cpu frequency downscaling mentioned in the above document is > only in relation to floating point operations. > But from other online discussions I gather that the data path for the > integer registers in Zen4 is also 256 bits and it allows to avoid > frequency downscaling for FP and heavy instructions.
Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load ports for two consecutive cycles, so from load throughput perspective there's no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512 still has benefits on Zen 4 since it's a richer instruction set (it also reduces pressure in the CPU front-end and is more power-efficient), but as the new AVX2 buffer_is_zero is saturating load ports I would expect that AVX512 can exceed its performance only by a small margin if at all, not anywhere close to 2x. > And looking at the optimizations for AVX2 in your other patch, would > unrolling the loop for AVX512 ops benefit from the speedup taken that the > data path has the same width? No, 256-bit datapath on Zen 4 means that it's easier to saturate it with 512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable to a similar AVX-256 loop unrolled twice. Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly. > If the frequency downscaling is not observed on some of the CPUs, can > AVX512 be maintained and used selectively for some > of the CPUs? Please note that a properly optimized buffer_is_zero is limited by load throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits of AVX-512 diminish more and more. I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial there for buffer_is_zero for reasons like reaching higher turbo clocks or higher memory parallelism. Finally, let's consider a somewhat broader perspective. Let's suppose buffer_is_zero takes 50% of overall application runtime, and 9 out of 10 buffers are found out to be non-zero in the inline wrapper that samples three bytes. Then the vectorized routine takes about 5% of application time, and speeding it up even by 20% only shaves off 1% from overall execution time. Alexander