On Tue, 6 Feb 2024, Elena Ufimtseva wrote:

> Hello Alexander
> 
> On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amona...@ispras.ru>
> wrote:
> 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> 
> I would like to point out that the frequency scaling is not currently an
> issue on AMD Zen4 Genoa CPUs, for example.
> And microcode architecture description here:
> https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
> Although, the cpu frequency downscaling mentioned in the above document is
> only in relation to floating point operations.
> But from other online discussions I gather that the data path for the
> integer registers in Zen4 is also 256 bits and it allows to avoid
> frequency downscaling for FP and heavy instructions.

Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load
ports for two consecutive cycles, so from load throughput perspective there's
no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512
still has benefits on Zen 4 since it's a richer instruction set (it also reduces
pressure in the CPU front-end and is more power-efficient), but as the new AVX2
buffer_is_zero is saturating load ports I would expect that AVX512 can exceed
its performance only by a small margin if at all, not anywhere close to 2x.

> And looking at the optimizations for AVX2 in your other patch, would
> unrolling the loop for AVX512 ops benefit from the speedup taken that the
> data path has the same width?

No, 256-bit datapath on Zen 4 means that it's easier to saturate it with
512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable
to a similar AVX-256 loop unrolled twice.

Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly.

> If the frequency downscaling is not observed on some of the CPUs, can
> AVX512 be maintained and used selectively for some
> of the CPUs?

Please note that a properly optimized buffer_is_zero is limited by load
throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load
bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits
of AVX-512 diminish more and more.

I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial
there for buffer_is_zero for reasons like reaching higher turbo clocks or
higher memory parallelism.

Finally, let's consider a somewhat broader perspective. Let's suppose
buffer_is_zero takes 50% of overall application runtime, and 9 out of
10 buffers are found out to be non-zero in the inline wrapper that samples
three bytes. Then the vectorized routine takes about 5% of application
time, and speeding it up even by 20% only shaves off 1% from overall
execution time.

Alexander

Reply via email to