On 07/19/2017 11:53 AM, Wilco Dijkstra wrote:
> Hi Jeff,
> 
> There is an issue with your AArch64 patch, it fails to apply properly and does
> so silently using 'patch'. I also noticed some odd control characters in the 
> other
> patches, but they didn't appear to fail (or at least everything builds).
I think everything was done on top of r250005 (July 5).  So it's
possible there's been conflicts introduced since then.  I'm not updating
regularly because that would invalidate the tests (which take an insane
mount of time to run).  I'll certainly re-sync and retest before any
kind of commit is made.  As for control characters, that's really
strange since the patch itself is an attachment.


> 
> Anyway with -Ofast -static the overall codesize increase is ~0.7% on SPEC2017,
> with several above 1% and one over 3%... So -O2 and dynamic linking should
> get you well above 1% on average. This is a bit too much static overhead to
> enable by default. The current version ICEs with a 64KB probe size so I can't
> see whether that has lower overhead. 
WRT the faults with large probe sizes -- that's not a huge surprise.
The probe size is not currently configurable.  If ARM wants to change
the probe interval, be my guest.  But you'll probably need to adjust the
prologue sequences somewhat.

Given the revamp in dynamic space, I'd like to ask you to retest for V3.

> 
> I briefly looked at the generated sequences, the biggest issue is that they
> don't work together and thus do not protect against jumping the stack guard.
> For example both alloca and outgoing arguments can allocate 4KB without
> probing, then call a function which allocates another 3KB on top of that
> (ie. max probe distance is 7KB...).
Examples please?  We should be probing the outgoing args at the probe
interval  once the total static frame is greater than 3k.  The dynamic
space should be probed by generic code.


Let's start with this (smallish frame, small outgoing args, small
dynamic area):

int
foo()
{
  char outgoing1[3 * 1024];
  char *outgoing2 = __builtin_alloca (4 * 1024);

  bar (outgoing1, outgoing2);
}



So yes, the loop for the dynamic space sucks for this case.  I wasn't
going to try too hard to optimize the constant dynamic space case as it
didn't happen on the primary target.  But it's really not that hard.
Done.  So now we have loopless inline probes as an option for dynamic
allocations.  That results in:

foo:
        // Initial probe since the static frame size is > 3k.
        str     xzr, [sp, -8]

        // Allocate the static frame
        sub     sp, sp, #3088

        // save fp/lr
        stp     x29, x30, [sp]

        // Set up frame pointer
        add     x29, sp, 0

        // Allocate 4k chunk of dynamic space
        sub     sp, sp, #4096

        // Argument setup for the call
        add     x0, x29, 16

        // Probe the just allocated space
        str     xzr, [sp, 4088]

        // Allocate the residual dynamic space
        sub     sp, sp, #16

        // More call argument setup
        mov     x1, sp

        // Probe the residual space.
        str     xzr, [sp, 8]


        bl      bar
        add     sp, x29, 0
        ldp     x29, x30, [sp]
        add     sp, sp, 3088
        ret

So I'm not sure what you want changed here.




If we change the dynamic allocation to 16k (large enough to still force
a loop in the dynamic area) we get:

foo:
        // Initial probe since static frame is > 3k
        str     xzr, [sp, -8]

        // Allocate the static frame
        sub     sp, sp, #3088

        // Save lr/fp
        stp     x29, x30, [sp]

        // Set up frame pointer
        add     x29, sp, 0

        // Compute end of loop into x0
        sub     x0, sp, #65536

        // Sigh.  I wish we didn't need to use
        // the temporary here.  It's minor though
        mov     x1, sp

        // we could rotate the loop here...
        cmp     x1, x0
        beq     .L3
.L6:

        // Allocate 4k chunk
        sub     sp, sp, #4096

        // Probe just allocated chunk
        str     xzr, [sp, 4088]

        // Test for end of loop
        mov     x1, sp
        cmp     x1, x0
        bne     .L6
.L3:
        // Allocate and probe residual
        sub     sp, sp, #16
        mov     x1, sp
        add     x0, x29, 16
        str     xzr, [sp, 8]


        bl      bar
        add     sp, x29, 0
        ldp     x29, x30, [sp]
        add     sp, sp, 3088
        ret


The most obvious inefficiency here is the loop could be rotated when its
size is known.  So adding loop rotation when we know it will iterate at
least once results in:

foo:
        // Initial probe
        str     xzr, [sp, -8]

        // Allocate 3088 bytes statically, no additional probing needed
        // for the statically allocated space
        sub     sp, sp, #3088

        // Save fp/lr
        stp     x29, x30, [sp]

        // Setup frame pointer
        add     x29, sp, 0

        // End of loop computation
        sub     x0, sp, #65536
.L2:
        // Allocate 4k
        sub     sp, sp, #4096

        // Compare for end of loop
        mov     x1, sp
        cmp     x1, x0

        //  Probe just allocated chunk
        str     xzr, [sp, 4088]

        // Conditional jump back to start of loop
        bne     .L2

        // Allocate residual
        sub     sp, sp, #16

        // Setup outgoing arguments
        mov     x1, sp
        add     x0, x29, 16

        // Probe residual
        str     xzr, [sp, 8]

        bl      bar
        add     sp, x29, 0
        ldp     x29, x30, [sp]
        add     sp, sp, 3088
        ret


The non-constant case will not have changed, but I'm not sure that
there's really anything in that case to do better.


If I make a huge outgoing argument list things still look sane.


foo:
        // Compute the rounded size of final adjust
        mov     x17, -12288
        movk    x17, 0xfffe, lsl 16

        // Initial probe because total size is large enough to require
        // probing
        str     xzr, [sp, -8]

        // initial adjustment
        sub     sp, sp, #3088

        // Save fp/lr & setup frame pointer
        stp     x29, x30, [sp]
        add     x29, sp, 0

        // Compute ending address for prologue probe loop
        add     x17, sp, x17

        // Prologue probing loop
.LPSRL0:
        // Allocate 4k chunk
        sub     sp, sp, 4096

        // Probe the just allocated chunk
        str     xzr, [sp, 4088]

        // Compare to final address and branch back if
        // not done
        cmp     sp, x17
        b.ne    .LPSRL0


        // Residual allocation
        sub     sp, sp, #2144

        // Argument preparation
        mov     w1, 1

        // Probe the residual allocation
        str     xzr, [sp, 2136]


        // Now allocate and probe inline for the dynamic area
        sub     sp, sp, #4096
        str     xzr, [sp, 4088]
        sub     sp, sp, #4096
        str     xzr, [sp, 4088]
        sub     sp, sp, #4096
        str     xzr, [sp, 4088]
        sub     sp, sp, #4096
        str     xzr, [sp, 4088]

        // Allocate residual for dynamic area
        sub     sp, sp, #16

        // Setup first arg
        add     x0, sp, 65536

        // Probe residual
        str     xzr, [sp, 8]

        // Now ~10000 stores into the outgoing args area and
        // other argument setup
        [ ... ]
        add     x0, x29, 16
        bl      bar

        
Which looks reasonable to my eye.

I need to retest with the improvements to the dynamic space obviously.


> There are also too many probes emitted, for example a function with a 7KB
> frame emits 2 explicit probes when it should emit just 1 (with a 4KB probe 
> size
> we can adjust the stack by 3KB then probe, then by another 4KB, then save
> callee-saves and adjust by up to 1KB for outgoing arguments without inserting
> more probes).
What we should be doing, per your request is emit an initial probe if we
know the function is going to require probing of any form.  Then we emit
probes at 4k intervals.  At least that's how I understood your
simplification.  So for a 7k stack that's two probes -- one at *sp at
the start of the prologue then the second after the first 4k is allocated.





> 
> I don't understand what last_probe_offset is for, it should always be zero
> after saving the callee-saves (though currently it is not set correctly). And 
> it
> has to be set to a fixed value to limit the maximum outgoing args when doing
> final_adjust.
It's supposed to catch the case where neither the initial adjustment nor
final adjustment in and of themselves require probes, but the
combination would.

My understanding was that you didn't want that level of trackign around
the callee saves.  It's also not clear if that will work in the presence
of separate shrink wrapping.

I have no idea what you mean by setting to a fixed value for limit the
maximum outgoing args.  We don't limit the maximum outgoing args -- we
probe that space just like any other as we cross 4k boundaries.

We're clearly not communicating well.  Example would probably help.

> 
> Finally emitting inline loops generates a large amount of code. Although we
> can easily reduce the overhead, especially for alloca, I think using helper
> functions seems best.
I did just revamp the dynamic bits to handle a small number of probes
inline without a loop and to rotate the loop when we know it's going to
be executed.  I'm not planning to use helper functions here.  If you
really think they're the way to go, please submit follow-up code.


I really don't see any issues with the aarch64 specific bits.  If you
can provide examples I will take a look.

jeff

Reply via email to