https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88839
Bug ID: 88839 Summary: [SVE] Poor implementation of blend-like permutes Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Compiling this code with -O3 -msve-vector-bits=256: typedef int v8si __attribute__((vector_size(32))); v8si f (v8si x, v8si y, v8si sel) { return __builtin_shuffle (x, y, (v8si) { 0, 9, 2, 11, 4, 13, 6, 15 }); } produces an inefficient TBL-based sequence. In these blend-like cases, where index I of the output comes from index I of one of the inputs, we should be able to use a SEL with an appropriate predicate constant. The preferred implementation of the above would be: ptrue p0.d, vl4 // { 1, 0, 1, 0, ... } when used as p0.s sel res, p0, y, x This will also be useful for the default VL-agnostic mode when implementing support for 2-operation SLP.