[Bug target/116582] gather is a win in some cases on zen CPUs

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 03 Sep 2024 05:57:46 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582


--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #3)
> Just for completeness the codegen for parest sparse matrix multiply is:
> 
>   0.31 │320:   kmovb         %k1,%k4
>   0.25 │       kmovb         %k1,%k5
>   0.28 │       vmovdqu32     (%rcx,%rax,1),%zmm0
>   0.32 │       vpmovzxdq     %ymm0,%zmm4
>   0.31 │       vextracti32x8 $0x1,%zmm0,%ymm0
>   0.48 │       vpmovzxdq     %ymm0,%zmm0
>  10.32 │       vgatherqpd    (%r14,%zmm4,8),%zmm2{%k4}
>   1.90 │       vfmadd231pd   (%rdx,%rax,2),%zmm2,%zmm1
>  14.86 │       vgatherqpd    (%r14,%zmm0,8),%zmm5{%k5}   
>   0.27 │       vfmadd231pd   0x40(%rdx,%rax,2),%zmm5,%zmm1    
>   0.26 │       add           $0x40,%rax
>   0.23 │       cmp           %rax,%rdi                   
>        │     ↑ jne           320                         
> 
> which looks OK to me.

The in-loop mask moves are odd, but yes.

So from the measurements we can conclude that the individual loads do not
behave the same as scalar loads with respect to prefetching (at least).
Of course we know the CPUs will still perform individual loads, even when
contiguous.  Maybe the gathers are simply excempt from influencing
the prefetcher at all.

[Bug target/116582] gather is a win in some cases on zen CPUs

Reply via email to