On Tue, May 27, 2014 at 10:10 AM, Bernd Edlinger <bernd.edlin...@hotmail.de> wrote: > Hi, > > I have been informed, that the following check-in may cause a slight > performance > regression on aarch64 and arm on trunk and gcc-4_9-branch: > See https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=205899 > > This can be seen with the following test example: > > test.c: > typedef struct > { > int *x; > int *y; > int *z; > int *m; > int *n; > int *o; > }A; > > typedef struct > { > int x; > int y; > int z; > int m; > int n; > int o; > }B; > > A a[6]; > B *b; > > void test(int i) > { > A *a1 = &a[i]; > B *b1 = &b[i]; > a1->x = &b1->x; > a1->y = &b1->y; > a1->z = &b1->z; > a1->m = &b1->m; > a1->n = &b1->n; > a1->o = &b1->o; > } > > when compiled with aarch64-elf-gcc -O3 -S we get the following > assembler code: > > test: > adrp x1, b > sxtw x3, w0 > mov w5, 24 > adrp x2, a > add x2, x2, :lo12:a > ldr x4, [x1, #:lo12:b] > lsl x1, x3, 2 > sub x1, x1, x3 > lsl x1, x1, 4 > smaddl x0, w0, w5, x4 > add x3, x2, x1 > add x7, x0, 8 > add x6, x0, 16 > add x8, x0, 4 > add x5, x0, 12 > add x4, x0, 20 > str x0, [x2, x1] > mov x0, x3 > mov x1, x3 > str x8, [x3, 8] > str x7, [x0, 16]! > str x6, [x1, 32]! > str x5, [x0, 8] > str x4, [x1, 8] > ret > > Note the two mov instructions, and that two extra registers are used > to store the values. The mov instructions have not been there before the > patch. > > After some investigation I found out how this happened: > > The difference is first visible with -fdump-rtl-expand-slim: > 1: NOTE_INSN_DELETED > 4: NOTE_INSN_BASIC_BLOCK 2 > 2: r83:SI=x0:SI > 3: NOTE_INSN_FUNCTION_BEG > 6: r74:DI=sign_extend(r83:SI) > 7: r85:DI=high(`b') > 8: r84:DI=r85:DI+low(`b') > REG_EQUAL `b' > 9: r87:SI=0x18 > 10: r86:DI=sign_extend(r83:SI)*sign_extend(r87:SI) > 11: r88:DI=[r84:DI] > 12: r76:DI=r88:DI+r86:DI > 13: r90:DI=high(`a') > 14: r89:DI=r90:DI+low(`a') > REG_EQUAL `a' > 15: r91:DI=sign_extend(r83:SI) > 16: r92:DI=r91:DI > 17: r93:DI=r92:DI<<0x2 > 18: r94:DI=r93:DI-r91:DI > REG_EQUAL r91:DI*0x3 > 19: r95:DI=r94:DI<<0x4 > 20: r94:DI=r95:DI > REG_EQUAL r91:DI*0x30 > 21: r96:DI=r89:DI+r94:DI > 22: [r96:DI]=r76:DI > 23: r98:DI=high(`a') > 24: r97:DI=r98:DI+low(`a') > REG_EQUAL `a' > 25: r99:DI=sign_extend(r83:SI) > 26: r100:DI=r99:DI > 27: r101:DI=r100:DI<<0x2 > 28: r102:DI=r101:DI-r99:DI > REG_EQUAL r99:DI*0x3 > 29: r103:DI=r102:DI<<0x4 > 30: r102:DI=r103:DI > REG_EQUAL r99:DI*0x30 > 31: r104:DI=r97:DI+r102:DI > 32: r105:DI=r76:DI+0x4 > 33: [r104:DI+0x8]=r105:DI > 34: r107:DI=high(`a') > 35: r106:DI=r107:DI+low(`a') > REG_EQUAL `a' > 36: r108:DI=sign_extend(r83:SI) > 37: r109:DI=r108:DI > 38: r110:DI=r109:DI<<0x2 > 39: r111:DI=r110:DI-r108:DI > REG_EQUAL r108:DI*0x3 > 40: r112:DI=r111:DI<<0x4 > 41: r111:DI=r112:DI > REG_EQUAL r108:DI*0x30 > 42: r113:DI=r106:DI+r111:DI > 43: r114:DI=r113:DI+0x10 > 44: r115:DI=r76:DI+0x8 > 45: [r114:DI]=r115:DI > 46: r117:DI=high(`a') > 47: r116:DI=r117:DI+low(`a') > REG_EQUAL `a' > 48: r118:DI=sign_extend(r83:SI) > 49: r119:DI=r118:DI > 50: r120:DI=r119:DI<<0x2 > 51: r121:DI=r120:DI-r118:DI > REG_EQUAL r118:DI*0x3 > 52: r122:DI=r121:DI<<0x4 > 53: r121:DI=r122:DI > REG_EQUAL r118:DI*0x30 > 54: r123:DI=r116:DI+r121:DI > 55: r124:DI=r123:DI+0x10 > 56: r125:DI=r76:DI+0xc > 57: [r124:DI+0x8]=r125:DI > 58: r127:DI=high(`a') > 59: r126:DI=r127:DI+low(`a') > REG_EQUAL `a' > 60: r128:DI=sign_extend(r83:SI) > 61: r129:DI=r128:DI > 62: r130:DI=r129:DI<<0x2 > 63: r131:DI=r130:DI-r128:DI > REG_EQUAL r128:DI*0x3 > 64: r132:DI=r131:DI<<0x4 > 65: r131:DI=r132:DI > REG_EQUAL r128:DI*0x30 > 66: r133:DI=r126:DI+r131:DI > 67: r134:DI=r133:DI+0x20 > 68: r135:DI=r76:DI+0x10 > 69: [r134:DI]=r135:DI > 70: r137:DI=high(`a') > 71: r136:DI=r137:DI+low(`a') > REG_EQUAL `a' > 72: r138:DI=sign_extend(r83:SI) > 73: r139:DI=r138:DI > 74: r140:DI=r139:DI<<0x2 > 75: r141:DI=r140:DI-r138:DI > REG_EQUAL r138:DI*0x3 > 76: r142:DI=r141:DI<<0x4 > 77: r141:DI=r142:DI > REG_EQUAL r138:DI*0x30 > 78: r143:DI=r136:DI+r141:DI > 79: r144:DI=r143:DI+0x20 > 80: r145:DI=r76:DI+0x14 > 81: [r144:DI+0x8]=r145:DI > > Note the offset +0x8 on the instructions 33, 57, 81 but not on 22, 45, 69. > This artefact was not there before the patch. > > Well, this causes little ripples in the later rtl-passes. > There is one pass that does a pretty good job at compensating > this effect: forward-propagate. In simple cases the resulting code > does not look any different, because fwprop folds all offsets together, > and the result looks as before. However in this example fwprop could > not remove some "dead" statements, and that made the auto-inc-dec pass > insert the mov statements. > > So, what caused this? > > You probably remember this statement in expr.c, which we did not fully > understand, but we wanted not to remove it because if could have subltle > effects on the code quality: > > /* The check for a constant address in TO_RTX not having VOIDmode > is probably no longer necessary. */ > if (MEM_P (to_rtx) > && GET_MODE (to_rtx) == BLKmode > && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode > && bitsize> 0 > && (bitpos % bitsize) == 0 > && (bitsize % GET_MODE_ALIGNMENT (mode1)) == 0 > && MEM_ALIGN (to_rtx) == GET_MODE_ALIGNMENT (mode1)) > { > to_rtx = adjust_address (to_rtx, mode1, bitpos / BITS_PER_UNIT); > bitregion_start = 0; > if (bitregion_end>= (unsigned HOST_WIDE_INT) bitpos) > bitregion_end -= bitpos; > bitpos = 0; > } > > Now I see, my patch had an influence on the trigger condition for this > statement: > > if (MEM_P (to_rtx)) > { > - if (volatilep && flag_strict_volatile_bitfields> 0) > + if (mode1 != VOIDmode) > to_rtx = adjust_address (to_rtx, mode1, 0); > else if (GET_MODE (to_rtx) == VOIDmode) > > > As it turns out, this made GET_MODE (to_rtx) be unequal to BLKmode > in most cases. > > In this example, get_inner_reference returns offset != NULL and > bitpos is either 0 or 64, for every second structure member. > > When the adjust_address here is executed, this bitpos is folded into > the to_rtx, and the generated rtl is much more uniform. > > So the condition for the if-statement has to be re-written, to > munge the bitpos into the ro_rtx, together with the offset, if > store_field can be expected to emit a single move instruction: > that means primarily the alignment is OK. > It should not depend on volatileness and a larger alignment > should also be no reason to bail out here. > > Sorry for this lengthy explanation! > > So I'd like to re-write the condition here, to something reasonable. > See the attached patch: > > With this patch applied the resulting code is much better again: > > test: > adrp x1, b > sxtw x2, w0 > adrp x3, a > mov w5, 24 > add x3, x3, :lo12:a > ldr x4, [x1, #:lo12:b] > lsl x1, x2, 2 > sub x1, x1, x2 > lsl x1, x1, 4 > add x2, x3, x1 > smaddl x0, w0, w5, x4 > str x0, [x3, x1] > add x8, x0, 4 > add x7, x0, 8 > add x6, x0, 12 > add x5, x0, 16 > add x4, x0, 20 > str x8, [x2, 8] > str x7, [x2, 16] > str x6, [x2, 24] > str x5, [x2, 32] > str x4, [x2, 40] > ret > > > Boot-Strapped and regression tested on X86_64 and arm-linux-gnueabihf. > Ok for trunk?
But the coment previously read /* A constant address in TO_RTX can have VOIDmode, we must not try to call force_reg for that case. Avoid that case. */ and you are removing that check. So I guess you want to retain && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode or investigate why we don't need to avoid this case anymore. In fact, that's probably the only change necessary, just drop the == BLKmode check? Your lengthy description doesn't reveal if you investigated that. It seems to me it tries to avoid ajdusting MEM (symbol-ref) for example? Maybe also for optimization reasons? Adding a new comment before the block reflecting your analysis above would be nice as well. Thanks, Richard. > > PS: I wonder if this patch should also go into the 4.9 branch, after a while > of course. > What do you think? > > > Thanks > Bernd. >