On 07/19/2017 11:53 AM, Wilco Dijkstra wrote: > Hi Jeff, > > There is an issue with your AArch64 patch, it fails to apply properly and does > so silently using 'patch'. I also noticed some odd control characters in the > other > patches, but they didn't appear to fail (or at least everything builds). I think everything was done on top of r250005 (July 5). So it's possible there's been conflicts introduced since then. I'm not updating regularly because that would invalidate the tests (which take an insane mount of time to run). I'll certainly re-sync and retest before any kind of commit is made. As for control characters, that's really strange since the patch itself is an attachment.
> > Anyway with -Ofast -static the overall codesize increase is ~0.7% on SPEC2017, > with several above 1% and one over 3%... So -O2 and dynamic linking should > get you well above 1% on average. This is a bit too much static overhead to > enable by default. The current version ICEs with a 64KB probe size so I can't > see whether that has lower overhead. WRT the faults with large probe sizes -- that's not a huge surprise. The probe size is not currently configurable. If ARM wants to change the probe interval, be my guest. But you'll probably need to adjust the prologue sequences somewhat. Given the revamp in dynamic space, I'd like to ask you to retest for V3. > > I briefly looked at the generated sequences, the biggest issue is that they > don't work together and thus do not protect against jumping the stack guard. > For example both alloca and outgoing arguments can allocate 4KB without > probing, then call a function which allocates another 3KB on top of that > (ie. max probe distance is 7KB...). Examples please? We should be probing the outgoing args at the probe interval once the total static frame is greater than 3k. The dynamic space should be probed by generic code. Let's start with this (smallish frame, small outgoing args, small dynamic area): int foo() { char outgoing1[3 * 1024]; char *outgoing2 = __builtin_alloca (4 * 1024); bar (outgoing1, outgoing2); } So yes, the loop for the dynamic space sucks for this case. I wasn't going to try too hard to optimize the constant dynamic space case as it didn't happen on the primary target. But it's really not that hard. Done. So now we have loopless inline probes as an option for dynamic allocations. That results in: foo: // Initial probe since the static frame size is > 3k. str xzr, [sp, -8] // Allocate the static frame sub sp, sp, #3088 // save fp/lr stp x29, x30, [sp] // Set up frame pointer add x29, sp, 0 // Allocate 4k chunk of dynamic space sub sp, sp, #4096 // Argument setup for the call add x0, x29, 16 // Probe the just allocated space str xzr, [sp, 4088] // Allocate the residual dynamic space sub sp, sp, #16 // More call argument setup mov x1, sp // Probe the residual space. str xzr, [sp, 8] bl bar add sp, x29, 0 ldp x29, x30, [sp] add sp, sp, 3088 ret So I'm not sure what you want changed here. If we change the dynamic allocation to 16k (large enough to still force a loop in the dynamic area) we get: foo: // Initial probe since static frame is > 3k str xzr, [sp, -8] // Allocate the static frame sub sp, sp, #3088 // Save lr/fp stp x29, x30, [sp] // Set up frame pointer add x29, sp, 0 // Compute end of loop into x0 sub x0, sp, #65536 // Sigh. I wish we didn't need to use // the temporary here. It's minor though mov x1, sp // we could rotate the loop here... cmp x1, x0 beq .L3 .L6: // Allocate 4k chunk sub sp, sp, #4096 // Probe just allocated chunk str xzr, [sp, 4088] // Test for end of loop mov x1, sp cmp x1, x0 bne .L6 .L3: // Allocate and probe residual sub sp, sp, #16 mov x1, sp add x0, x29, 16 str xzr, [sp, 8] bl bar add sp, x29, 0 ldp x29, x30, [sp] add sp, sp, 3088 ret The most obvious inefficiency here is the loop could be rotated when its size is known. So adding loop rotation when we know it will iterate at least once results in: foo: // Initial probe str xzr, [sp, -8] // Allocate 3088 bytes statically, no additional probing needed // for the statically allocated space sub sp, sp, #3088 // Save fp/lr stp x29, x30, [sp] // Setup frame pointer add x29, sp, 0 // End of loop computation sub x0, sp, #65536 .L2: // Allocate 4k sub sp, sp, #4096 // Compare for end of loop mov x1, sp cmp x1, x0 // Probe just allocated chunk str xzr, [sp, 4088] // Conditional jump back to start of loop bne .L2 // Allocate residual sub sp, sp, #16 // Setup outgoing arguments mov x1, sp add x0, x29, 16 // Probe residual str xzr, [sp, 8] bl bar add sp, x29, 0 ldp x29, x30, [sp] add sp, sp, 3088 ret The non-constant case will not have changed, but I'm not sure that there's really anything in that case to do better. If I make a huge outgoing argument list things still look sane. foo: // Compute the rounded size of final adjust mov x17, -12288 movk x17, 0xfffe, lsl 16 // Initial probe because total size is large enough to require // probing str xzr, [sp, -8] // initial adjustment sub sp, sp, #3088 // Save fp/lr & setup frame pointer stp x29, x30, [sp] add x29, sp, 0 // Compute ending address for prologue probe loop add x17, sp, x17 // Prologue probing loop .LPSRL0: // Allocate 4k chunk sub sp, sp, 4096 // Probe the just allocated chunk str xzr, [sp, 4088] // Compare to final address and branch back if // not done cmp sp, x17 b.ne .LPSRL0 // Residual allocation sub sp, sp, #2144 // Argument preparation mov w1, 1 // Probe the residual allocation str xzr, [sp, 2136] // Now allocate and probe inline for the dynamic area sub sp, sp, #4096 str xzr, [sp, 4088] sub sp, sp, #4096 str xzr, [sp, 4088] sub sp, sp, #4096 str xzr, [sp, 4088] sub sp, sp, #4096 str xzr, [sp, 4088] // Allocate residual for dynamic area sub sp, sp, #16 // Setup first arg add x0, sp, 65536 // Probe residual str xzr, [sp, 8] // Now ~10000 stores into the outgoing args area and // other argument setup [ ... ] add x0, x29, 16 bl bar Which looks reasonable to my eye. I need to retest with the improvements to the dynamic space obviously. > There are also too many probes emitted, for example a function with a 7KB > frame emits 2 explicit probes when it should emit just 1 (with a 4KB probe > size > we can adjust the stack by 3KB then probe, then by another 4KB, then save > callee-saves and adjust by up to 1KB for outgoing arguments without inserting > more probes). What we should be doing, per your request is emit an initial probe if we know the function is going to require probing of any form. Then we emit probes at 4k intervals. At least that's how I understood your simplification. So for a 7k stack that's two probes -- one at *sp at the start of the prologue then the second after the first 4k is allocated. > > I don't understand what last_probe_offset is for, it should always be zero > after saving the callee-saves (though currently it is not set correctly). And > it > has to be set to a fixed value to limit the maximum outgoing args when doing > final_adjust. It's supposed to catch the case where neither the initial adjustment nor final adjustment in and of themselves require probes, but the combination would. My understanding was that you didn't want that level of trackign around the callee saves. It's also not clear if that will work in the presence of separate shrink wrapping. I have no idea what you mean by setting to a fixed value for limit the maximum outgoing args. We don't limit the maximum outgoing args -- we probe that space just like any other as we cross 4k boundaries. We're clearly not communicating well. Example would probably help. > > Finally emitting inline loops generates a large amount of code. Although we > can easily reduce the overhead, especially for alloca, I think using helper > functions seems best. I did just revamp the dynamic bits to handle a small number of probes inline without a loop and to rotate the loop when we know it's going to be executed. I'm not planning to use helper functions here. If you really think they're the way to go, please submit follow-up code. I really don't see any issues with the aarch64 specific bits. If you can provide examples I will take a look. jeff