On Mon, May 13, 2024 at 4:29 AM liuhongt <hongtao....@intel.com> wrote:
>
> As testcase in the PR, O3 cunrolli may prevent vectorization for the
> innermost loop and increase register pressure.
> The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL.
> ul != UR_ALL is needed since some small loop complete unrolling at O2 relies
> the reduction.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> No big impact for SPEC2017.
> Ok for trunk?

This removes the 1/3 reduction when unrolling a loop nest (the case I was
concerned about).  Unrolling of a nest is by iterating in
tree_unroll_loops_completely
so the to be unrolled loop appears innermost.  So I think you need a new
parameter on tree_unroll_loops_completely_1 indicating whether we're in the
first iteration (or whether to assume inner most loops will "simplify").

Few comments below

> gcc/ChangeLog:
>
>         PR tree-optimization/112325
>         * tree-ssa-loop-ivcanon.cc (estimated_unrolled_size): Add 2
>         new parameters: loop and ul, and remove unr_insns reduction
>         for innermost loop.
>         (try_unroll_loop_completely): Pass loop and ul to
>         estimated_unrolled_size.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/tree-ssa/pr112325.c: New test.
>         * gcc.dg/vect/pr69783.c: Add extra option --param
>         max-completely-peeled-insns=300.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/pr112325.c | 57 ++++++++++++++++++++++++
>  gcc/testsuite/gcc.dg/vect/pr69783.c      |  2 +-
>  gcc/tree-ssa-loop-ivcanon.cc             | 16 +++++--
>  3 files changed, 71 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> new file mode 100644
> index 00000000000..14208b3e7f8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr112325.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> +
> +typedef unsigned short ggml_fp16_t;
> +static float table_f32_f16[1 << 16];
> +
> +inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
> +    unsigned short s;
> +    __builtin_memcpy(&s, &f, sizeof(unsigned short));
> +    return table_f32_f16[s];
> +}
> +
> +typedef struct {
> +    ggml_fp16_t d;
> +    ggml_fp16_t m;
> +    unsigned char qh[4];
> +    unsigned char qs[32 / 2];
> +} block_q5_1;
> +
> +typedef struct {
> +    float d;
> +    float s;
> +    char qs[32];
> +} block_q8_1;
> +
> +void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * 
> restrict vx, const void * restrict vy) {
> +    const int qk = 32;
> +    const int nb = n / qk;
> +
> +    const block_q5_1 * restrict x = vx;
> +    const block_q8_1 * restrict y = vy;
> +
> +    float sumf = 0.0;
> +
> +    for (int i = 0; i < nb; i++) {
> +        unsigned qh;
> +        __builtin_memcpy(&qh, x[i].qh, sizeof(qh));
> +
> +        int sumi = 0;
> +
> +        for (int j = 0; j < qk/2; ++j) {
> +            const unsigned char xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
> +            const unsigned char xh_1 = ((qh >> (j + 12)) ) & 0x10;
> +
> +            const int x0 = (x[i].qs[j] & 0xF) | xh_0;
> +            const int x1 = (x[i].qs[j] >> 4) | xh_1;
> +
> +            sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> +        }
> +
> +        sumf += (ggml_lookup_fp16_to_fp32(x[i].d)*y[i].d)*sumi + 
> ggml_lookup_fp16_to_fp32(x[i].m)*y[i].s;
> +    }
> +
> +    *s = sumf;
> +}
> +
> +/* { dg-final { scan-tree-dump {(?n)Not unrolling loop [1-9] \(--param 
> max-completely-peel-times limit reached} "cunrolli"} } */
> diff --git a/gcc/testsuite/gcc.dg/vect/pr69783.c 
> b/gcc/testsuite/gcc.dg/vect/pr69783.c
> index 5df95d0ce4e..a1f75514d72 100644
> --- a/gcc/testsuite/gcc.dg/vect/pr69783.c
> +++ b/gcc/testsuite/gcc.dg/vect/pr69783.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target vect_float } */
> -/* { dg-additional-options "-Ofast -funroll-loops" } */
> +/* { dg-additional-options "-Ofast -funroll-loops --param 
> max-completely-peeled-insns=300" } */

If we rely on unrolling of a loop can you put #pragma unroll [N]
before the respective loop
instead?

>  #define NXX 516
>  #define NYY 516
> diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> index bf017137260..5e0eca647a1 100644
> --- a/gcc/tree-ssa-loop-ivcanon.cc
> +++ b/gcc/tree-ssa-loop-ivcanon.cc
> @@ -444,7 +444,9 @@ tree_estimate_loop_size (class loop *loop, edge exit, 
> edge edge_to_cancel,
>
>  static unsigned HOST_WIDE_INT
>  estimated_unrolled_size (struct loop_size *size,
> -                        unsigned HOST_WIDE_INT nunroll)
> +                        unsigned HOST_WIDE_INT nunroll,
> +                        enum unroll_level ul,
> +                        class loop* loop)
>  {
>    HOST_WIDE_INT unr_insns = ((nunroll)
>                              * (HOST_WIDE_INT) (size->overall
> @@ -453,7 +455,15 @@ estimated_unrolled_size (struct loop_size *size,
>      unr_insns = 0;
>    unr_insns += size->last_iteration - 
> size->last_iteration_eliminated_by_peeling;
>
> -  unr_insns = unr_insns * 2 / 3;
> +  /* For innermost loop, loop body is not likely to be simplied as much as 
> 1/3.
> +     and may increase a lot of register pressure.
> +     UL != UL_ALL is need to unroll small loop at O2.  */
> +  class loop *loop_father = loop_outer (loop);
> +  if (loop->inner || !loop_father

Do we ever get here for !loop_father?  We shouldn't.

> +      || loop_father->latch == EXIT_BLOCK_PTR_FOR_FN (cfun)

This means you excempt all loops that are direct children of the loop
root tree.  That doesn't make much sense.

> +      || ul != UL_ALL)

This is also quite odd - we're being more optimistic for UL_NO_GROWTH
than for UL_ALL?  This doesn't make much sense.

Overall I think this means removal of being optimistic doesn't work so well?

If we need some extra leeway for UL_NO_GROWTH for what we expect
to unroll it might be better to add sth like --param
nogrowth-completely-peeled-insns
specifying a fixed surplus size?  Or we need to look at what's the problem
with the testcases regressing or the one you are trying to fix.

I did experiment with better estimating cleanup done at some point
(see attached),
but didn't get to finishing that (and as said, as we're running VN on the result
we'd ideally do that as part of the estimation somehow).

Richard.

> +    unr_insns = unr_insns * 2 / 3;
> +
>    if (unr_insns <= 0)
>      unr_insns = 1;
>
> @@ -837,7 +847,7 @@ try_unroll_loop_completely (class loop *loop,
>
>           unsigned HOST_WIDE_INT ninsns = size.overall;
>           unsigned HOST_WIDE_INT unr_insns
> -           = estimated_unrolled_size (&size, n_unroll);
> +           = estimated_unrolled_size (&size, n_unroll, ul, loop);
>           if (dump_file && (dump_flags & TDF_DETAILS))
>             {
>               fprintf (dump_file, "  Loop size: %d\n", (int) ninsns);
> --
> 2.31.1
>

Attachment: p
Description: Binary data

Reply via email to