On Tue, Apr 29, 2014 at 7:50 AM, Yury Gribov <y.gri...@samsung.com> wrote: > Hi all, > > I've recently noticed that GCC generates suboptimal code for Asan on ARM > targets. E.g. for a 4-byte memory access check > > (shadow_val != 0) & (last_byte >= shadow_val) > > we get the following sequence: > > mov r2, r0, lsr #3 > and r3, r0, #7 > add r3, r3, #3 > add r2, r2, #536870912 > ldrb r2, [r2] @ zero_extendqisi2 > sxtb r2, r2 > cmp r3, r2 > movlt r3, #0 > movge r3, #1 > cmp r2, #0 > moveq r3, #0 > cmp r3, #0 > bne .L5 > ldr r0, [r0] > > Obviously a shorter code is possible: > > mov r3, r0, lsr #3 > and r1, r0, #7 > add r1, r1, #4 > add r3, r3, #536870912 > ldrb r3, [r3] @ zero_extendqisi2 > sxtb r3, r3 > cmp r3, #0 > cmpne r1, r3 > bgt .L5 > ldr r0, [r0] > > A 30% improvement looked quite important given that Asan usually increases > code-size by 1.5-2x so I decided to investigate this. It turned out that ARM > backend already has full support for dominated comparisons (cmp-cmpne-bgt > sequence above) and can generate efficient code if we provide it with a > slightly more explicit gimple sequence: > > (shadow_val != 0) & (last_byte + 1 > shadow_val) > > Ideally backend should be able perform this transform itself. But I'm not > sure this is possible: it needs to know that last_range + 1 can not overflow > and this info is not available in RTL (because we don't have VRP pass > there). > > I have attached a simple patch which changes Asan pass to generate the > ARM-friendly code. I've only bootstrapped/regtested on x64 but I can perform > additional tests on ARM if the patch make sense. As far as I can tell it > does not worsen sanitized code on other platforms (x86/x64) while > significantly improving ARM (15% less code for bzip). > > The patch is certainly not ideal: > * it makes target-specific changes in machine-independent code > * it does not help with 1-byte accesses (forwprop pass thinks that it's > always beneficial to convert x + 1 > y to x >= y so it reverts my change) > * it only improves Asan code whereas it would be great if ARM backend could > improve generic RTL code > but it achieves significant improvement on ARM without hurting other > platforms. > > So my questions are: > * is this kind of target-specific tweaking acceptable in middle-end? > * if not - what would be a better option?
I'd say the place is obviously not going to work well (see your forwprop comment). If this kind of transform is profitable I suggest you look at doing it at RTL expansion time. I suppose the conditional is serialized by fold or ifcombine already so we are expanding from cond1_2 = shadow_val_1 != 0; cond2_3 = last_byte_4 >= shadow_val_1; cond_5 = cond1_2 & cond2_3; if (cond_5 != 0) ... which means cond1_2 and cond2_3 are TERable so you can lookup their definiitions and expand them more efficiently. Richard. > -Y