https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582
--- Comment #5 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Richard Biener from comment #4) > (In reply to Jan Hubicka from comment #3) > > Just for completeness the codegen for parest sparse matrix multiply is: > > > > 0.31 │320: kmovb %k1,%k4 > > 0.25 │ kmovb %k1,%k5 > > 0.28 │ vmovdqu32 (%rcx,%rax,1),%zmm0 > > 0.32 │ vpmovzxdq %ymm0,%zmm4 > > 0.31 │ vextracti32x8 $0x1,%zmm0,%ymm0 > > 0.48 │ vpmovzxdq %ymm0,%zmm0 > > 10.32 │ vgatherqpd (%r14,%zmm4,8),%zmm2{%k4} > > 1.90 │ vfmadd231pd (%rdx,%rax,2),%zmm2,%zmm1 > > 14.86 │ vgatherqpd (%r14,%zmm0,8),%zmm5{%k5} > > 0.27 │ vfmadd231pd 0x40(%rdx,%rax,2),%zmm5,%zmm1 > > 0.26 │ add $0x40,%rax > > 0.23 │ cmp %rax,%rdi > > │ ↑ jne 320 > > > > which looks OK to me. > > The in-loop mask moves are odd, but yes. > > It's because vgatherqpd will set k4 to 0, and it needs to be reinitialized to -1(%k1)