https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88839

            Bug ID: 88839
           Summary: [SVE] Poor implementation of blend-like permutes
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

Compiling this code with -O3 -msve-vector-bits=256:

typedef int v8si __attribute__((vector_size(32)));
v8si
f (v8si x, v8si y, v8si sel)
{
  return __builtin_shuffle (x, y, (v8si) { 0, 9, 2, 11, 4, 13, 6, 15 });
}

produces an inefficient TBL-based sequence.

In these blend-like cases, where index I of the output comes from index I of
one of the inputs, we should be able to use a SEL with an appropriate predicate
constant.  The preferred implementation of the above would be:

        ptrue    p0.d, vl4        // { 1, 0, 1, 0, ... } when used as p0.s
        sel      res, p0, y, x

This will also be useful for the default VL-agnostic mode when implementing
support for 2-operation SLP.

Reply via email to