Tamar Christina <tamar.christ...@arm.com> writes: >> > + >> > + /* No probe leave. */ >> > + ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab); >> > + return ""; >> >> With the CFA stuff and constant load, I think this works out as: >> >> --------------------------------------------- >> # 12 insns >> mov r15, base >> mov adjustment, N >> 1: >> cmp adjustment, guard_size >> b.lt 2f >> sub base, base, guard_size >> str xzr, [base, limit] >> sub adjustment, adjustment, guard_size >> b 1b >> 2: >> sub base, base, adjustment >> cmp adjustment, limit >> b.le 3f >> str xzr, [base, limit] >> 3: >> --------------------------------------------- >> >> What do you think about something like: >> >> --------------------------------------------- >> # 10 insns >> mov adjustment, N >> sub r15, base, adjustment >> subs adjustment, adjustment, min_probe_threshold >> b.lo 2f >> 1: >> add base, x15, adjustment >> str xzr, [base, 0] >> subs adjustment, adjustment, 16 >> and adjustment, adjustment, ~(guard_size-1) >> b.hs 1b >> 2: >> mov base, r15 >> --------------------------------------------- >> >> or (with different trade-offs): >> >> --------------------------------------------- >> # 11 insns >> mov adjustment, N >> sub r15, base, adjustment >> subs adjustment, adjustment, min_probe_threshold >> b.lo 2f >> # Might be 0, leading to a double probe >> and r14, adjustment, guard_size-1 >> 1: >> add base, x15, adjustment >> str xzr, [base, 0] >> subs adjustment, adjustment, r14 >> mov r14, guard_size >> b.hs 1b >> 2: >> mov base, r15 >> --------------------------------------------- >> >> or (longer, but with a simpler loop): >> >> --------------------------------------------- >> # 12 insns >> mov adjustment, N >> sub r15, base, adjustment >> subs adjustment, adjustment, min_probe_threshold >> b.lo 2f >> str xzr, [base, -16]! >> sub adjustment, adjustment, 32 >> and adjustment, adjustment, -(guard_size-1) >> 1: >> add base, x15, adjustment >> str xzr, [base, 0] >> subs adjustment, adjustment, guard_size >> b.hs 1b >> 2: >> mov base, r15 >> --------------------------------------------- >> >> with the CFA based on r15+offset? >> >> These loops probe more often than necessary in some cases, >> but they only need a single branch in the common case that >> ADJUSTMENT <= MIN_PROBE_THRESHOLD. > > I haven't changed the loop yet because I'm a bit on the edge about > whether the implementation difficulties would outweigh the benefits. > We are planning on doing something smarter for SVE so optimizing these > loops only to replace them later may not be time well spent now. > > The problem is that to support both 4KB and 64KB pages, instructions such > as subs would require different immediates and shifts. Granted we technically > only support these two so I could hardcode the values, but that would mean > these functions are less general than the rest.
Because of the min_probe_threshold? You could conservatively clamp it to the next lowest value that's in range, which we could do without having to hard-code specific values. I think it would be better to do that even with the current code, since hard-coding 2048 with: /* Test if ADJUSTMENT < RESIDUAL_PROBE_GUARD, in principle any power of two larger than 1024B would work, but we need one that works for all supported guard-sizes. What we actually want to check is guard-size - 1KB, but this immediate won't fit inside a cmp without requiring a tempory, so instead we just accept a smaller immediate that doesn't, we may probe a bit more often but that doesn't matter much on the long run. */ seems a bit of a hack. > If you think it would be worthwhile, I'd be happy to use one of these > loops instead. Yeah, I still think we should do this unless we can commit to doing the optimised version by a specific date, and that date is soon enough that the optimisation could reasonably be backported to GCC 8. > @@ -4830,11 +4929,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, > rtx temp2, > } > } > > - HOST_WIDE_INT size; > + HOST_WIDE_INT size = 0; > /* If SIZE is not large enough to require probing, just adjust the stack > and > exit. */ > - if (!poly_size.is_constant (&size) > - || known_lt (poly_size, min_probe_threshold) > + if ((poly_size.is_constant (&size) > + && known_lt (poly_size, min_probe_threshold)) > || !flag_stack_clash_protection) > { > aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p); I still think we should remove this poly_size.is_constant, and instead: > @@ -4842,9 +4941,64 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx > temp2, > } > if (dump_file) > - fprintf (dump_file, > - "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes" > - ", probing will be required.\n", size); > + { > + if (poly_size.is_constant ()) > + fprintf (dump_file, > + "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC > + " bytes, probing will be required.\n", size); > + else > + { > + fprintf (dump_file, "Stack clash SVE prologue: "); > + print_dec (poly_size, dump_file); > + fprintf (dump_file, " bytes, dynamic probing will be required.\n"); > + } > + } > + > + /* Handle the SVE non-constant case first. */ > + if (!poly_size.is_constant ()) ...use is_constant (&size) here, and put the dump messages for the constant and non-constant cases in their respective constant and non-constant blocks. That way each use of "size" is directly protected by an is_constant call, and there's no need to initialise size to 0. The non-constant case doesn't have the new special handling of final_adjustment_p, so I think the !is_constant block should assert !final_adjustment_p. Thanks, Richard