https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582
--- Comment #5 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> (In reply to Jan Hubicka from comment #3)
> > Just for completeness the codegen for parest sparse matrix multiply is:
> >
> > 0.31 │320: kmovb %k1,%k4
> > 0.25 │ kmovb %k1,%k5
> > 0.28 │ vmovdqu32 (%rcx,%rax,1),%zmm0
> > 0.32 │ vpmovzxdq %ymm0,%zmm4
> > 0.31 │ vextracti32x8 $0x1,%zmm0,%ymm0
> > 0.48 │ vpmovzxdq %ymm0,%zmm0
> > 10.32 │ vgatherqpd (%r14,%zmm4,8),%zmm2{%k4}
> > 1.90 │ vfmadd231pd (%rdx,%rax,2),%zmm2,%zmm1
> > 14.86 │ vgatherqpd (%r14,%zmm0,8),%zmm5{%k5}
> > 0.27 │ vfmadd231pd 0x40(%rdx,%rax,2),%zmm5,%zmm1
> > 0.26 │ add $0x40,%rax
> > 0.23 │ cmp %rax,%rdi
> > │ ↑ jne 320
> >
> > which looks OK to me.
>
> The in-loop mask moves are odd, but yes.
>
>
It's because vgatherqpd will set k4 to 0, and it needs to be reinitialized to
-1(%k1)