https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- I'm noting that for skylake cost we have _28 * _33 1 times scalar_stmt costs 16 in prologue and _28 * _33 1 times vector_stmt costs 16 in body but the load/store costs are just 12, compared to znver2 this tips the bias over to allow vectorization while for znver2 I currently see no vectorization. For generic I also see vectorization. Note that costing currently assumes that the cost model niter check is performed first and short-cuts all the versioning conditions. But since we emit _248 = (unsigned int) mk_113; _247 = _248 + 4294967295; _246 = _247 > 2; _245 = stride.4_74 != 0; _244 = _245 & _246; ... _183 = _184 | _211; _182 = _183 & _244; if (_182 != 0) goto <bb 27>; [80.00%] else goto <bb 28>; [20.00%] on GIMPLE how things are expanded depends on some luck and with the standalone testcase and -Ofast with generic tuning we emit the > 2 cost model check quite late: addq $1, %rdi imulq %r13, %rdi leaq (%rax,%rdi), %rcx movq 32(%rsp), %rax leaq (%rax,%rcx), %rsi movq (%rsp), %rax leaq 0(,%rsi,8), %rdx addq %rax, %rcx leaq 0(,%rcx,8), %rax addq %r13, %rcx salq $3, %rcx cmpq %rcx, %rdx setg %cl addq %r13, %rsi salq $3, %rsi cmpq %rsi, %rax setg %sil orb %cl, %sil je .L8 movl -100(%rsp), %esi leal -1(%rsi), %ecx cmpl $2, %ecx <----- movl 112(%rsp), %ecx seta %sil testl %ecx, %ecx setg %cl testb %cl, %sil je .L8 let me try to hack^Wfix this.