On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote: > > Add support for vector permute cost since various permutes can expand into a > complex > sequence of instructions. This fixes major performance regressions due to > recent changes > in the SLP vectorizer (which now vectorizes more aggressively and emits many > complex > permutes). > > Set the cost to > 1 for all microarchitectures so that the number of permutes > is usually zero > and regressions disappear. An example of the kind of code that might be > emitted for > VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order: > > adrp x4, .LC16 > ldr q5, [x4, #:lo12:.LC16 > eor v1.16b, v1.16b, v0.16b > eor v0.16b, v1.16b, v0.16b > eor v1.16b, v1.16b, v0.16b > tbl v0.16b, {v0.16b - v1.16b}, v5.16b > > Regress passes. This fixes regressions that were introduced recently, so OK > for commit? > > > ChangeLog: > 2015-12-15 Wilco Dijkstra <wdijk...@arm.com> > > * gcc/config/aarch64/aarch64.c (generic_vector_cost): > Set vec_permute_cost. > (cortexa57_vector_cost): Likewise. > (exynosm1_vector_cost): Likewise. > (xgene1_vector_cost): Likewise. > (aarch64_builtin_vectorization_cost): Use vec_permute_cost. > * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost): > Add vec_permute_cost entry. > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost = > 1, /* scalar_load_cost */ > 1, /* scalar_store_cost */ > 1, /* vec_stmt_cost */ > + 2, /* vec_permute_cost */ > 1, /* vec_to_scalar_cost */ > 1, /* scalar_to_vec_cost */ > 1, /* vec_align_load_cost */
Is there any reasoning behind making this 2? Do we now miss vectorization for some of the cheaper permutes? Across the cost models/pipeline descriptions that have been contributed to GCC I think that this is a sensible change to the generic costs, but I just want to check there was some reasoning/experimentation behind the number you picked. As permutes can have such wildly different costs, this all seems like a good candidate for some future much more involved hook from the vectorizer to the back-end specifying the candidate permute operation and requesting a cost (part of the bigger gimple costs framework?). Thanks, James