On Tue, May 27, 2014 at 10:10 AM, Bernd Edlinger
<[email protected]> wrote:
> Hi,
>
> I have been informed, that the following check-in may cause a slight
> performance
> regression on aarch64 and arm on trunk and gcc-4_9-branch:
> See https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=205899
>
> This can be seen with the following test example:
>
> test.c:
> typedef struct
> {
> int *x;
> int *y;
> int *z;
> int *m;
> int *n;
> int *o;
> }A;
>
> typedef struct
> {
> int x;
> int y;
> int z;
> int m;
> int n;
> int o;
> }B;
>
> A a[6];
> B *b;
>
> void test(int i)
> {
> A *a1 = &a[i];
> B *b1 = &b[i];
> a1->x = &b1->x;
> a1->y = &b1->y;
> a1->z = &b1->z;
> a1->m = &b1->m;
> a1->n = &b1->n;
> a1->o = &b1->o;
> }
>
> when compiled with aarch64-elf-gcc -O3 -S we get the following
> assembler code:
>
> test:
> adrp x1, b
> sxtw x3, w0
> mov w5, 24
> adrp x2, a
> add x2, x2, :lo12:a
> ldr x4, [x1, #:lo12:b]
> lsl x1, x3, 2
> sub x1, x1, x3
> lsl x1, x1, 4
> smaddl x0, w0, w5, x4
> add x3, x2, x1
> add x7, x0, 8
> add x6, x0, 16
> add x8, x0, 4
> add x5, x0, 12
> add x4, x0, 20
> str x0, [x2, x1]
> mov x0, x3
> mov x1, x3
> str x8, [x3, 8]
> str x7, [x0, 16]!
> str x6, [x1, 32]!
> str x5, [x0, 8]
> str x4, [x1, 8]
> ret
>
> Note the two mov instructions, and that two extra registers are used
> to store the values. The mov instructions have not been there before the
> patch.
>
> After some investigation I found out how this happened:
>
> The difference is first visible with -fdump-rtl-expand-slim:
> 1: NOTE_INSN_DELETED
> 4: NOTE_INSN_BASIC_BLOCK 2
> 2: r83:SI=x0:SI
> 3: NOTE_INSN_FUNCTION_BEG
> 6: r74:DI=sign_extend(r83:SI)
> 7: r85:DI=high(`b')
> 8: r84:DI=r85:DI+low(`b')
> REG_EQUAL `b'
> 9: r87:SI=0x18
> 10: r86:DI=sign_extend(r83:SI)*sign_extend(r87:SI)
> 11: r88:DI=[r84:DI]
> 12: r76:DI=r88:DI+r86:DI
> 13: r90:DI=high(`a')
> 14: r89:DI=r90:DI+low(`a')
> REG_EQUAL `a'
> 15: r91:DI=sign_extend(r83:SI)
> 16: r92:DI=r91:DI
> 17: r93:DI=r92:DI<<0x2
> 18: r94:DI=r93:DI-r91:DI
> REG_EQUAL r91:DI*0x3
> 19: r95:DI=r94:DI<<0x4
> 20: r94:DI=r95:DI
> REG_EQUAL r91:DI*0x30
> 21: r96:DI=r89:DI+r94:DI
> 22: [r96:DI]=r76:DI
> 23: r98:DI=high(`a')
> 24: r97:DI=r98:DI+low(`a')
> REG_EQUAL `a'
> 25: r99:DI=sign_extend(r83:SI)
> 26: r100:DI=r99:DI
> 27: r101:DI=r100:DI<<0x2
> 28: r102:DI=r101:DI-r99:DI
> REG_EQUAL r99:DI*0x3
> 29: r103:DI=r102:DI<<0x4
> 30: r102:DI=r103:DI
> REG_EQUAL r99:DI*0x30
> 31: r104:DI=r97:DI+r102:DI
> 32: r105:DI=r76:DI+0x4
> 33: [r104:DI+0x8]=r105:DI
> 34: r107:DI=high(`a')
> 35: r106:DI=r107:DI+low(`a')
> REG_EQUAL `a'
> 36: r108:DI=sign_extend(r83:SI)
> 37: r109:DI=r108:DI
> 38: r110:DI=r109:DI<<0x2
> 39: r111:DI=r110:DI-r108:DI
> REG_EQUAL r108:DI*0x3
> 40: r112:DI=r111:DI<<0x4
> 41: r111:DI=r112:DI
> REG_EQUAL r108:DI*0x30
> 42: r113:DI=r106:DI+r111:DI
> 43: r114:DI=r113:DI+0x10
> 44: r115:DI=r76:DI+0x8
> 45: [r114:DI]=r115:DI
> 46: r117:DI=high(`a')
> 47: r116:DI=r117:DI+low(`a')
> REG_EQUAL `a'
> 48: r118:DI=sign_extend(r83:SI)
> 49: r119:DI=r118:DI
> 50: r120:DI=r119:DI<<0x2
> 51: r121:DI=r120:DI-r118:DI
> REG_EQUAL r118:DI*0x3
> 52: r122:DI=r121:DI<<0x4
> 53: r121:DI=r122:DI
> REG_EQUAL r118:DI*0x30
> 54: r123:DI=r116:DI+r121:DI
> 55: r124:DI=r123:DI+0x10
> 56: r125:DI=r76:DI+0xc
> 57: [r124:DI+0x8]=r125:DI
> 58: r127:DI=high(`a')
> 59: r126:DI=r127:DI+low(`a')
> REG_EQUAL `a'
> 60: r128:DI=sign_extend(r83:SI)
> 61: r129:DI=r128:DI
> 62: r130:DI=r129:DI<<0x2
> 63: r131:DI=r130:DI-r128:DI
> REG_EQUAL r128:DI*0x3
> 64: r132:DI=r131:DI<<0x4
> 65: r131:DI=r132:DI
> REG_EQUAL r128:DI*0x30
> 66: r133:DI=r126:DI+r131:DI
> 67: r134:DI=r133:DI+0x20
> 68: r135:DI=r76:DI+0x10
> 69: [r134:DI]=r135:DI
> 70: r137:DI=high(`a')
> 71: r136:DI=r137:DI+low(`a')
> REG_EQUAL `a'
> 72: r138:DI=sign_extend(r83:SI)
> 73: r139:DI=r138:DI
> 74: r140:DI=r139:DI<<0x2
> 75: r141:DI=r140:DI-r138:DI
> REG_EQUAL r138:DI*0x3
> 76: r142:DI=r141:DI<<0x4
> 77: r141:DI=r142:DI
> REG_EQUAL r138:DI*0x30
> 78: r143:DI=r136:DI+r141:DI
> 79: r144:DI=r143:DI+0x20
> 80: r145:DI=r76:DI+0x14
> 81: [r144:DI+0x8]=r145:DI
>
> Note the offset +0x8 on the instructions 33, 57, 81 but not on 22, 45, 69.
> This artefact was not there before the patch.
>
> Well, this causes little ripples in the later rtl-passes.
> There is one pass that does a pretty good job at compensating
> this effect: forward-propagate. In simple cases the resulting code
> does not look any different, because fwprop folds all offsets together,
> and the result looks as before. However in this example fwprop could
> not remove some "dead" statements, and that made the auto-inc-dec pass
> insert the mov statements.
>
> So, what caused this?
>
> You probably remember this statement in expr.c, which we did not fully
> understand, but we wanted not to remove it because if could have subltle
> effects on the code quality:
>
> /* The check for a constant address in TO_RTX not having VOIDmode
> is probably no longer necessary. */
> if (MEM_P (to_rtx)
> && GET_MODE (to_rtx) == BLKmode
> && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode
> && bitsize> 0
> && (bitpos % bitsize) == 0
> && (bitsize % GET_MODE_ALIGNMENT (mode1)) == 0
> && MEM_ALIGN (to_rtx) == GET_MODE_ALIGNMENT (mode1))
> {
> to_rtx = adjust_address (to_rtx, mode1, bitpos / BITS_PER_UNIT);
> bitregion_start = 0;
> if (bitregion_end>= (unsigned HOST_WIDE_INT) bitpos)
> bitregion_end -= bitpos;
> bitpos = 0;
> }
>
> Now I see, my patch had an influence on the trigger condition for this
> statement:
>
> if (MEM_P (to_rtx))
> {
> - if (volatilep && flag_strict_volatile_bitfields> 0)
> + if (mode1 != VOIDmode)
> to_rtx = adjust_address (to_rtx, mode1, 0);
> else if (GET_MODE (to_rtx) == VOIDmode)
>
>
> As it turns out, this made GET_MODE (to_rtx) be unequal to BLKmode
> in most cases.
>
> In this example, get_inner_reference returns offset != NULL and
> bitpos is either 0 or 64, for every second structure member.
>
> When the adjust_address here is executed, this bitpos is folded into
> the to_rtx, and the generated rtl is much more uniform.
>
> So the condition for the if-statement has to be re-written, to
> munge the bitpos into the ro_rtx, together with the offset, if
> store_field can be expected to emit a single move instruction:
> that means primarily the alignment is OK.
> It should not depend on volatileness and a larger alignment
> should also be no reason to bail out here.
>
> Sorry for this lengthy explanation!
>
> So I'd like to re-write the condition here, to something reasonable.
> See the attached patch:
>
> With this patch applied the resulting code is much better again:
>
> test:
> adrp x1, b
> sxtw x2, w0
> adrp x3, a
> mov w5, 24
> add x3, x3, :lo12:a
> ldr x4, [x1, #:lo12:b]
> lsl x1, x2, 2
> sub x1, x1, x2
> lsl x1, x1, 4
> add x2, x3, x1
> smaddl x0, w0, w5, x4
> str x0, [x3, x1]
> add x8, x0, 4
> add x7, x0, 8
> add x6, x0, 12
> add x5, x0, 16
> add x4, x0, 20
> str x8, [x2, 8]
> str x7, [x2, 16]
> str x6, [x2, 24]
> str x5, [x2, 32]
> str x4, [x2, 40]
> ret
>
>
> Boot-Strapped and regression tested on X86_64 and arm-linux-gnueabihf.
> Ok for trunk?
But the coment previously read
/* A constant address in TO_RTX can have VOIDmode, we must not try
to call force_reg for that case. Avoid that case. */
and you are removing that check. So I guess you want to retain
&& GET_MODE (XEXP (to_rtx, 0)) != VOIDmode
or investigate why we don't need to avoid this case anymore. In fact,
that's probably the only change necessary, just drop the == BLKmode
check? Your lengthy description doesn't reveal if you investigated that.
It seems to me it tries to avoid ajdusting MEM (symbol-ref) for example?
Maybe also for optimization reasons?
Adding a new comment before the block reflecting your analysis above
would be nice as well.
Thanks,
Richard.
>
> PS: I wonder if this patch should also go into the 4.9 branch, after a while
> of course.
> What do you think?
>
>
> Thanks
> Bernd.
>