On Tue, Dec 15, 2015 at 11:35:45AM +0000, Wilco Dijkstra wrote:
> 
> Add support for vector permute cost since various permutes can expand into a 
> complex
> sequence of instructions.  This fixes major performance regressions due to 
> recent changes
> in the SLP vectorizer (which now vectorizes more aggressively and emits many 
> complex 
> permutes).
> 
> Set the cost to > 1 for all microarchitectures so that the number of permutes 
> is usually zero
> and regressions disappear.  An example of the kind of code that might be 
> emitted for
> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
> 
>         adrp    x4, .LC16
>         ldr     q5, [x4, #:lo12:.LC16
>         eor     v1.16b, v1.16b, v0.16b
>         eor     v0.16b, v1.16b, v0.16b
>         eor     v1.16b, v1.16b, v0.16b
>         tbl     v0.16b, {v0.16b - v1.16b}, v5.16b
> 
> Regress passes. This fixes regressions that were introduced recently, so OK 
> for commit?
> 
> 
> ChangeLog:
> 2015-12-15  Wilco Dijkstra  <wdijk...@arm.com>
> 
>       * gcc/config/aarch64/aarch64.c (generic_vector_cost):
>       Set vec_permute_cost.
>       (cortexa57_vector_cost): Likewise.
>       (exynosm1_vector_cost): Likewise.
>       (xgene1_vector_cost): Likewise.
>       (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
>       * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
>       Add vec_permute_cost entry.
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>    1, /* scalar_load_cost  */
>    1, /* scalar_store_cost  */
>    1, /* vec_stmt_cost  */
> +  2, /* vec_permute_cost  */
>    1, /* vec_to_scalar_cost  */
>    1, /* scalar_to_vec_cost  */
>    1, /* vec_align_load_cost  */

Is there any reasoning behind making this 2? Do we now miss vectorization
for some of the cheaper permutes? Across the cost models/pipeline
descriptions that have been contributed to GCC I think that this is a
sensible change to the generic costs, but I just want to check there
was some reasoning/experimentation behind the number you picked.

As permutes can have such wildly different costs, this all seems like a good
candidate for some future much more involved hook from the vectorizer to the
back-end specifying the candidate permute operation and requesting a cost
(part of the bigger gimple costs framework?).

Thanks,
James

Reply via email to