On 10/13/2017 02:26 PM, Wilco Dijkstra wrote: > Hi, > > To continue the review of the AArch64 frame code I tried a few examples > to figure out what it does now. For initial_adjust <= 63*1024 and > final_adjust < > 1024 there are no probes inserted as expected, ie. the vast majority of > functions are unaffected. So that works perfectly. Right.
> > For larger frames the first oddity is that there are now 2 separate params > controlling how probes are generated: > > stack-clash-protection-guard-size (default 12, but set to 16 on AArch64) > stack-clash-protection-probe-interval (default 12) > > I don't see how this makes sense. These values are closely related, so if > one is different from the other, probing becomes ineffective/incorrect. > For example we generate code that trivially bypasses the guard despite > all the probing: My hope would be that we simply don't ever use the params. They were done as much for *you* to experiment with as anything. I'd happy just delete them as there's essentially no guard rails to ensure their values are sane. > > --param=stack-clash-protection-probe-interval=13 > --param=stack-clash-protection-guard-size=12 > > So if there is a good reason to continue with 2 separate values, we must > force probe interval <= guard size! The param code really isn't designed to enforce values that are inter-dependent. It has a min, max & default values. No more, no less. If you set up something inconsistent with the params, it's simply not going to work. > > Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes > crashes due to the offsets used in the probes - we don't need large offsets > as we want to probe close to the bottom of the stack. Not a surprise. While I tried to handle larger intervals, I certainly didn't test them. Given the ISA I wouldn't expect an interval > 12 to be useful or necessarily even work correctly. > > Functions with a large stack emit like alloca a lot of code, here I used > --param=stack-clash-protection-probe-interval=15: > > int f1(int x) > { > char arr[128*1024]; > return arr[x]; > } > > f1: > mov x16, 64512 > sub sp, sp, x16 > .cfi_def_cfa_offset 64512 > mov x16, -32768 > add sp, sp, x16 > .cfi_def_cfa_offset -1024 > str xzr, [sp, 32760] > add sp, sp, x16 > .cfi_def_cfa_offset -66560 > str xzr, [sp, 32760] > sub sp, sp, #1024 > .cfi_def_cfa_offset -65536 > str xzr, [sp, 1016] > ldrb w0, [sp, w0, sxtw] > .cfi_def_cfa_offset 131072 > add sp, sp, 131072 > .cfi_def_cfa_offset 0 > ret > > Note the cfa offsets are wrong. Yes. They definitely look wrong. There's a clear logic error in setting up the ADJUST_CFA note when the probing interval is larger than 2**12. That should be easily fixed. Let me poke at it. > > There is an odd mix of a big initial adjustment, then some probes+adjustments > and > then a final adjustment and probe for the remainder. I can't see the point of > having > both an initial and remainder adjustment. I would expect this: > > sub sp, sp, 65536 > str xzr, [sp, 1024] > sub sp, sp, 65536 > str xzr, [sp, 1024] > ldrb w0, [sp, w0, sxtw] > add sp, sp, 131072 > ret I'm really not able to justify spending further time optimizing the aarch64 implementation. I've done the best I can. You can take the work as-is or improve it, but I really can't justify further time investment on that architecture. > > > int f2(int x) > { > char arr[128*1024]; > return arr[x]; > } > > f2: > mov x16, 64512 > sub sp, sp, x16 > mov x16, -65536 > movk x16, 0xfffd, lsl 16 > add x16, sp, x16 > .LPSRL0: > sub sp, sp, 4096 > str xzr, [sp, 4088] > cmp sp, x16 > b.ne .LPSRL0 > sub sp, sp, #1024 > str xzr, [sp, 1016] > ldrb w0, [sp, w0, sxtw] > add sp, sp, 262144 > ret > > The cfa entries are OK for this case. There is a mix of positive/negative > offsets which > makes things confusing. Again there are 3 kinds of adjustments when for this > size we > only need the loop. > > Reusing the existing gen_probe_stack_range code appears a bad idea since > it ignores the probe interval and just defaults to 4KB. I don't see why it > should be > any more complex than this: > > sub x16, sp, 262144 // only need temporary if > 1MB > .LPSRL0: > sub sp, sp, 65536 > str xzr, [sp, 1024] > cmp sp, x16 > b.ne .LPSRL0 > ldrb w0, [sp, w0, sxtw] > add sp, sp, 262144 > ret > > Probe insertion if final adjustment >= 1024 also generates a lot of redundant > code - although this is more a theoretical issue given this is so rare. Again, if ARM wants this optimized, then ARM's engineers are going to have to take the lead here. I've invested all I can reasonably invest in terms of trying optimize the probing for this target. jeff