https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jan Hubicka from comment #3) > Just for completeness the codegen for parest sparse matrix multiply is: > > 0.31 │320: kmovb %k1,%k4 > 0.25 │ kmovb %k1,%k5 > 0.28 │ vmovdqu32 (%rcx,%rax,1),%zmm0 > 0.32 │ vpmovzxdq %ymm0,%zmm4 > 0.31 │ vextracti32x8 $0x1,%zmm0,%ymm0 > 0.48 │ vpmovzxdq %ymm0,%zmm0 > 10.32 │ vgatherqpd (%r14,%zmm4,8),%zmm2{%k4} > 1.90 │ vfmadd231pd (%rdx,%rax,2),%zmm2,%zmm1 > 14.86 │ vgatherqpd (%r14,%zmm0,8),%zmm5{%k5} > 0.27 │ vfmadd231pd 0x40(%rdx,%rax,2),%zmm5,%zmm1 > 0.26 │ add $0x40,%rax > 0.23 │ cmp %rax,%rdi > │ ↑ jne 320 > > which looks OK to me. The in-loop mask moves are odd, but yes. So from the measurements we can conclude that the individual loads do not behave the same as scalar loads with respect to prefetching (at least). Of course we know the CPUs will still perform individual loads, even when contiguous. Maybe the gathers are simply excempt from influencing the prefetcher at all.