https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116582
--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Just for completeness the codegen for parest sparse matrix multiply is: 0.31 │320: kmovb %k1,%k4 0.25 │ kmovb %k1,%k5 0.28 │ vmovdqu32 (%rcx,%rax,1),%zmm0 0.32 │ vpmovzxdq %ymm0,%zmm4 0.31 │ vextracti32x8 $0x1,%zmm0,%ymm0 0.48 │ vpmovzxdq %ymm0,%zmm0 10.32 │ vgatherqpd (%r14,%zmm4,8),%zmm2{%k4} 1.90 │ vfmadd231pd (%rdx,%rax,2),%zmm2,%zmm1 14.86 │ vgatherqpd (%r14,%zmm0,8),%zmm5{%k5} 0.27 │ vfmadd231pd 0x40(%rdx,%rax,2),%zmm5,%zmm1 0.26 │ add $0x40,%rax 0.23 │ cmp %rax,%rdi │ ↑ jne 320 which looks OK to me.