https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118145
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |14.3 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Btw, I warned that having reduc_* patterns for two-lane vectors could have this side-effect. On x86 specifically the load and store costs are comparatively high compared to the operation cost, so saving one scalar load gets us a fairly large "buffer" to use for slow vector ops. On trunk with SSE4 we are using ptest + sete instead of moving to GPR: _Z13canEncodeZeroPKh: .LFB0: .cfi_startproc movdqu (%rdi), %xmm0 ptest %xmm0, %xmm0 sete %al that might be superior - but it also shows costing is difficult. The vectorizer itself does not consider the reduction result being used by a comparison only. For the plus it shows the saved scalar load is making up for the extra stmts cost (and we don't consider code size or dependence chain length). t.ii:6:12: note: Cost model analysis: _4 + _5 1 times scalar_stmt costs 4 in body MEM <unsigned long> [(char * {ref-all})buffer_3(D)] 1 times scalar_load costs 12 in body MEM <unsigned long> [(char * {ref-all})buffer_3(D) + 8B] 1 times scalar_load costs 12 in body MEM <unsigned long> [(char * {ref-all})buffer_3(D)] 1 times unaligned_load (misalign -1) costs 12 in body _4 + _5 1 times vector_stmt costs 4 in body _4 + _5 1 times vec_perm costs 4 in body _4 + _5 1 times vec_to_scalar costs 4 in body _4 + _5 0 times scalar_stmt costs 0 in body t.ii:6:12: note: Cost model analysis for part in loop 0: Vector cost: 24 Scalar cost: 28