Tamar Christina <tamar.christ...@arm.com> writes:
>> > +
>> > +  /* No probe leave.  */
>> > +  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
>> > +  return "";
>> 
>> With the CFA stuff and constant load, I think this works out as:
>> 
>> ---------------------------------------------
>> # 12 insns
>>      mov     r15, base
>>      mov     adjustment, N
>> 1:
>>      cmp     adjustment, guard_size
>>      b.lt    2f
>>      sub     base, base, guard_size
>>      str     xzr, [base, limit]
>>      sub     adjustment, adjustment, guard_size
>>      b       1b
>> 2:
>>      sub     base, base, adjustment
>>      cmp     adjustment, limit
>>      b.le    3f
>>      str     xzr, [base, limit]
>> 3:
>> ---------------------------------------------
>> 
>> What do you think about something like:
>> 
>> ---------------------------------------------
>> # 10 insns
>>      mov     adjustment, N
>>      sub     r15, base, adjustment
>>      subs    adjustment, adjustment, min_probe_threshold
>>      b.lo    2f
>> 1:
>>      add     base, x15, adjustment
>>      str     xzr, [base, 0]
>>      subs    adjustment, adjustment, 16
>>      and     adjustment, adjustment, ~(guard_size-1)
>>      b.hs    1b
>> 2:
>>      mov     base, r15
>> ---------------------------------------------
>> 
>> or (with different trade-offs):
>> 
>> ---------------------------------------------
>> # 11 insns
>>      mov     adjustment, N
>>      sub     r15, base, adjustment
>>      subs    adjustment, adjustment, min_probe_threshold
>>      b.lo    2f
>>      # Might be 0, leading to a double probe
>>      and     r14, adjustment, guard_size-1
>> 1:
>>      add     base, x15, adjustment
>>      str     xzr, [base, 0]
>>      subs    adjustment, adjustment, r14
>>      mov     r14, guard_size
>>      b.hs    1b
>> 2:
>>      mov     base, r15
>> ---------------------------------------------
>> 
>> or (longer, but with a simpler loop):
>> 
>> ---------------------------------------------
>> # 12 insns
>>      mov     adjustment, N
>>      sub     r15, base, adjustment
>>      subs    adjustment, adjustment, min_probe_threshold
>>      b.lo    2f
>>      str     xzr, [base, -16]!
>>      sub     adjustment, adjustment, 32
>>      and     adjustment, adjustment, -(guard_size-1)
>> 1:
>>      add     base, x15, adjustment
>>      str     xzr, [base, 0]
>>      subs    adjustment, adjustment, guard_size
>>      b.hs    1b
>> 2:
>>      mov     base, r15
>> ---------------------------------------------
>> 
>> with the CFA based on r15+offset?
>> 
>> These loops probe more often than necessary in some cases,
>> but they only need a single branch in the common case that
>> ADJUSTMENT <= MIN_PROBE_THRESHOLD.
>
> I haven't changed the loop yet because I'm a bit on the edge about
> whether the implementation difficulties would outweigh the benefits.
> We are planning on doing something smarter for SVE so optimizing these
> loops only to replace them later may not be time well spent now.
>
> The problem is that to support both 4KB and 64KB pages, instructions such
> as subs would require different immediates and shifts. Granted we technically
> only support these two so I could hardcode the values, but that would mean
> these functions are less general than the rest.

Because of the min_probe_threshold?  You could conservatively clamp it
to the next lowest value that's in range, which we could do without
having to hard-code specific values.  I think it would be better
to do that even with the current code, since hard-coding 2048 with:

  /* Test if ADJUSTMENT < RESIDUAL_PROBE_GUARD, in principle any power of two
     larger than 1024B would work, but we need one that works for all supported
     guard-sizes.  What we actually want to check is guard-size - 1KB, but this
     immediate won't fit inside a cmp without requiring a tempory, so instead we
     just accept a smaller immediate that doesn't, we may probe a bit more often
     but that doesn't matter much on the long run.  */

seems a bit of a hack.

> If you think it would be worthwhile, I'd be happy to use one of these
> loops instead.

Yeah, I still think we should do this unless we can commit to doing
the optimised version by a specific date, and that date is soon enough
that the optimisation could reasonably be backported to GCC 8.

> @@ -4830,11 +4929,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, 
> rtx temp2,
>       }
>      }
> 
> -  HOST_WIDE_INT size;
> +  HOST_WIDE_INT size = 0;
>    /* If SIZE is not large enough to require probing, just adjust the stack 
> and
>       exit.  */
> -  if (!poly_size.is_constant (&size)
> -      || known_lt (poly_size, min_probe_threshold)
> +  if ((poly_size.is_constant (&size)
> +       && known_lt (poly_size, min_probe_threshold))
>        || !flag_stack_clash_protection)
>      {
>        aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);

I still think we should remove this poly_size.is_constant, and instead:

> @@ -4842,9 +4941,64 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx 
> temp2,
>      }
 
>    if (dump_file)
> -    fprintf (dump_file,
> -          "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
> -          ", probing will be required.\n", size);
> +    {
> +      if (poly_size.is_constant ())
> +     fprintf (dump_file,
> +              "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
> +              " bytes, probing will be required.\n", size);
> +      else
> +     {
> +       fprintf (dump_file, "Stack clash SVE prologue: ");
> +       print_dec (poly_size, dump_file);
> +       fprintf (dump_file, " bytes, dynamic probing will be required.\n");
> +     }
> +    }
> +
> +  /* Handle the SVE non-constant case first.  */
> +  if (!poly_size.is_constant ())

...use is_constant (&size) here, and put the dump messages for the
constant and non-constant cases in their respective constant and
non-constant blocks.  That way each use of "size" is directly protected
by an is_constant call, and there's no need to initialise size to 0.

The non-constant case doesn't have the new special handling of
final_adjustment_p, so I think the !is_constant block should assert
!final_adjustment_p.

Thanks,
Richard

Reply via email to