https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #3)
> Just for completeness the codegen for parest sparse matrix multiply is:
> 
>   0.31 │320:   kmovb         %k1,%k4
>   0.25 │       kmovb         %k1,%k5
>   0.28 │       vmovdqu32     (%rcx,%rax,1),%zmm0
>   0.32 │       vpmovzxdq     %ymm0,%zmm4
>   0.31 │       vextracti32x8 $0x1,%zmm0,%ymm0
>   0.48 │       vpmovzxdq     %ymm0,%zmm0
>  10.32 │       vgatherqpd    (%r14,%zmm4,8),%zmm2{%k4}
>   1.90 │       vfmadd231pd   (%rdx,%rax,2),%zmm2,%zmm1
>  14.86 │       vgatherqpd    (%r14,%zmm0,8),%zmm5{%k5}   
>   0.27 │       vfmadd231pd   0x40(%rdx,%rax,2),%zmm5,%zmm1    
>   0.26 │       add           $0x40,%rax
>   0.23 │       cmp           %rax,%rdi                   
>        │     ↑ jne           320                         
> 
> which looks OK to me.

The in-loop mask moves are odd, but yes.

So from the measurements we can conclude that the individual loads do not
behave the same as scalar loads with respect to prefetching (at least).
Of course we know the CPUs will still perform individual loads, even when
contiguous.  Maybe the gathers are simply excempt from influencing
the prefetcher at all.

Reply via email to