https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118145

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.3

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, I warned that having reduc_* patterns for two-lane vectors could have this
side-effect.  On x86 specifically the load and store costs are comparatively
high compared to the operation cost, so saving one scalar load gets us a fairly
large "buffer" to use for slow vector ops.

On trunk with SSE4 we are using ptest + sete instead of moving to GPR:

_Z13canEncodeZeroPKh:
.LFB0:
        .cfi_startproc
        movdqu  (%rdi), %xmm0
        ptest   %xmm0, %xmm0
        sete    %al

that might be superior - but it also shows costing is difficult.  The
vectorizer itself does not consider the reduction result being used by
a comparison only.

For the plus it shows the saved scalar load is making up for the extra
stmts cost (and we don't consider code size or dependence chain length).

t.ii:6:12: note: Cost model analysis:
_4 + _5 1 times scalar_stmt costs 4 in body
MEM <unsigned long> [(char * {ref-all})buffer_3(D)] 1 times scalar_load costs
12 in body
MEM <unsigned long> [(char * {ref-all})buffer_3(D) + 8B] 1 times scalar_load
costs 12 in body
MEM <unsigned long> [(char * {ref-all})buffer_3(D)] 1 times unaligned_load
(misalign -1) costs 12 in body
_4 + _5 1 times vector_stmt costs 4 in body
_4 + _5 1 times vec_perm costs 4 in body
_4 + _5 1 times vec_to_scalar costs 4 in body
_4 + _5 0 times scalar_stmt costs 0 in body
t.ii:6:12: note: Cost model analysis for part in loop 0:
  Vector cost: 24
  Scalar cost: 28

Reply via email to