Re: [PATCH] Fix an AARCH64/ARM performance regression

Richard Biener Tue, 27 May 2014 02:07:33 -0700

On Tue, May 27, 2014 at 10:10 AM, Bernd Edlinger
<bernd.edlin...@hotmail.de> wrote:
> Hi,
>
> I have been informed, that the following check-in may cause a slight 
> performance
> regression on aarch64 and arm on trunk and gcc-4_9-branch:
> See https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=205899
>
> This can be seen with the following test example:
>
> test.c:
> typedef struct
> {
>   int *x;
>   int *y;
>   int *z;
>   int *m;
>   int *n;
>   int *o;
> }A;
>
> typedef struct
> {
>   int x;
>   int y;
>   int z;
>   int m;
>   int n;
>   int o;
> }B;
>
> A a[6];
> B *b;
>
> void test(int i)
> {
>   A *a1 = &a[i];
>   B *b1 = &b[i];
>   a1->x = &b1->x;
>   a1->y = &b1->y;
>   a1->z = &b1->z;
>    a1->m = &b1->m;
>   a1->n = &b1->n;
>    a1->o = &b1->o;
> }
>
> when compiled with aarch64-elf-gcc -O3 -S we get the following
> assembler code:
>
> test:
>         adrp         x1, b
>         sxtw         x3, w0
>         mov           w5, 24
>         adrp         x2, a
>         add            x2, x2, :lo12:a
>         ldr          x4, [x1, #:lo12:b]
>         lsl               x1, x3, 2
>         sub            x1, x1, x3
>         lsl               x1, x1, 4
>         smaddl  x0, w0, w5, x4
>         add            x3, x2, x1
>         add            x7, x0, 8
>         add            x6, x0, 16
>         add            x8, x0, 4
>                add            x5, x0, 12
>         add            x4, x0, 20
>         str              x0, [x2, x1]
>         mov           x0, x3
>         mov           x1, x3
>                str              x8, [x3, 8]
>                str              x7, [x0, 16]!
>                str              x6, [x1, 32]!
>                str              x5, [x0, 8]
>         str              x4, [x1, 8]
>         ret
>
> Note the two mov instructions, and that two extra registers are used
> to store the values.  The mov instructions have not been there before the 
> patch.
>
> After some investigation I found out how this happened:
>
> The difference is first visible with -fdump-rtl-expand-slim:
>     1: NOTE_INSN_DELETED
>     4: NOTE_INSN_BASIC_BLOCK 2
>     2: r83:SI=x0:SI
>     3: NOTE_INSN_FUNCTION_BEG
>     6: r74:DI=sign_extend(r83:SI)
>     7: r85:DI=high(`b')
>     8: r84:DI=r85:DI+low(`b')
>       REG_EQUAL `b'
>     9: r87:SI=0x18
>    10: r86:DI=sign_extend(r83:SI)*sign_extend(r87:SI)
>    11: r88:DI=[r84:DI]
>    12: r76:DI=r88:DI+r86:DI
>    13: r90:DI=high(`a')
>    14: r89:DI=r90:DI+low(`a')
>       REG_EQUAL `a'
>    15: r91:DI=sign_extend(r83:SI)
>    16: r92:DI=r91:DI
>    17: r93:DI=r92:DI<<0x2
>    18: r94:DI=r93:DI-r91:DI
>       REG_EQUAL r91:DI*0x3
>    19: r95:DI=r94:DI<<0x4
>    20: r94:DI=r95:DI
>       REG_EQUAL r91:DI*0x30
>    21: r96:DI=r89:DI+r94:DI
>    22: [r96:DI]=r76:DI
>    23: r98:DI=high(`a')
>    24: r97:DI=r98:DI+low(`a')
>       REG_EQUAL `a'
>    25: r99:DI=sign_extend(r83:SI)
>    26: r100:DI=r99:DI
>    27: r101:DI=r100:DI<<0x2
>    28: r102:DI=r101:DI-r99:DI
>       REG_EQUAL r99:DI*0x3
>    29: r103:DI=r102:DI<<0x4
>    30: r102:DI=r103:DI
>       REG_EQUAL r99:DI*0x30
>    31: r104:DI=r97:DI+r102:DI
>    32: r105:DI=r76:DI+0x4
>    33: [r104:DI+0x8]=r105:DI
>    34: r107:DI=high(`a')
>    35: r106:DI=r107:DI+low(`a')
>       REG_EQUAL `a'
>    36: r108:DI=sign_extend(r83:SI)
>    37: r109:DI=r108:DI
>    38: r110:DI=r109:DI<<0x2
>    39: r111:DI=r110:DI-r108:DI
>       REG_EQUAL r108:DI*0x3
>    40: r112:DI=r111:DI<<0x4
>    41: r111:DI=r112:DI
>       REG_EQUAL r108:DI*0x30
>    42: r113:DI=r106:DI+r111:DI
>    43: r114:DI=r113:DI+0x10
>    44: r115:DI=r76:DI+0x8
>    45: [r114:DI]=r115:DI
>    46: r117:DI=high(`a')
>    47: r116:DI=r117:DI+low(`a')
>       REG_EQUAL `a'
>    48: r118:DI=sign_extend(r83:SI)
>    49: r119:DI=r118:DI
>    50: r120:DI=r119:DI<<0x2
>    51: r121:DI=r120:DI-r118:DI
>       REG_EQUAL r118:DI*0x3
>    52: r122:DI=r121:DI<<0x4
>    53: r121:DI=r122:DI
>       REG_EQUAL r118:DI*0x30
>    54: r123:DI=r116:DI+r121:DI
>    55: r124:DI=r123:DI+0x10
>    56: r125:DI=r76:DI+0xc
>    57: [r124:DI+0x8]=r125:DI
>    58: r127:DI=high(`a')
>    59: r126:DI=r127:DI+low(`a')
>       REG_EQUAL `a'
>    60: r128:DI=sign_extend(r83:SI)
>    61: r129:DI=r128:DI
>    62: r130:DI=r129:DI<<0x2
>    63: r131:DI=r130:DI-r128:DI
>       REG_EQUAL r128:DI*0x3
>    64: r132:DI=r131:DI<<0x4
>    65: r131:DI=r132:DI
>       REG_EQUAL r128:DI*0x30
>    66: r133:DI=r126:DI+r131:DI
>    67: r134:DI=r133:DI+0x20
>    68: r135:DI=r76:DI+0x10
>    69: [r134:DI]=r135:DI
>    70: r137:DI=high(`a')
>    71: r136:DI=r137:DI+low(`a')
>       REG_EQUAL `a'
>    72: r138:DI=sign_extend(r83:SI)
>    73: r139:DI=r138:DI
>    74: r140:DI=r139:DI<<0x2
>    75: r141:DI=r140:DI-r138:DI
>       REG_EQUAL r138:DI*0x3
>    76: r142:DI=r141:DI<<0x4
>    77: r141:DI=r142:DI
>       REG_EQUAL r138:DI*0x30
>    78: r143:DI=r136:DI+r141:DI
>    79: r144:DI=r143:DI+0x20
>    80: r145:DI=r76:DI+0x14
>    81: [r144:DI+0x8]=r145:DI
>
> Note the offset +0x8 on the instructions 33, 57, 81 but not on 22, 45, 69.
> This artefact was not there before the patch.
>
> Well, this causes little ripples in the later rtl-passes.
> There is one pass that does a pretty good job at compensating
> this effect: forward-propagate.   In simple cases the resulting code
> does not look any different, because fwprop folds all offsets together,
> and the result looks as before.   However in this example fwprop could
> not remove some "dead" statements, and that made the auto-inc-dec pass
> insert the mov statements.
>
> So, what caused this?
>
> You probably remember this statement in expr.c, which we did not fully
> understand, but we wanted not to remove it because if could have subltle
> effects on the code quality:
>
>           /* The check for a constant address in TO_RTX not having VOIDmode
>              is probably no longer necessary.  */
>           if (MEM_P (to_rtx)
>                  && GET_MODE (to_rtx) == BLKmode
>     && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode
>     && bitsize> 0
>     && (bitpos % bitsize) == 0
>                  && (bitsize % GET_MODE_ALIGNMENT (mode1)) == 0
>     && MEM_ALIGN (to_rtx) == GET_MODE_ALIGNMENT (mode1))
>             {
>    to_rtx = adjust_address (to_rtx, mode1, bitpos / BITS_PER_UNIT);
>    bitregion_start = 0;
>                 if (bitregion_end>= (unsigned HOST_WIDE_INT) bitpos)
>      bitregion_end -= bitpos;
>                 bitpos = 0;
>             }
>
> Now I see, my patch had an influence on the trigger condition for this 
> statement:
>
>  if (MEM_P (to_rtx))
>          {
> -         if (volatilep && flag_strict_volatile_bitfields> 0)
> +         if (mode1 != VOIDmode)
>                  to_rtx = adjust_address (to_rtx, mode1, 0);
>    else if (GET_MODE (to_rtx) == VOIDmode)
>
>
> As it turns out, this made GET_MODE (to_rtx) be unequal to BLKmode
> in most cases.
>
> In this example, get_inner_reference returns offset != NULL and
> bitpos is either 0 or 64, for every second structure member.
>
> When the adjust_address here is executed, this bitpos is folded into
> the to_rtx, and the generated rtl is much more uniform.
>
> So the condition for the if-statement has to be re-written, to
> munge the bitpos into the ro_rtx, together with the offset, if
> store_field can be expected to emit a single move instruction:
> that means primarily the alignment is OK.
> It should not depend on volatileness and a larger alignment
> should also be no reason to bail out here.
>
> Sorry for this lengthy explanation!
>
> So I'd like to re-write the condition here, to something reasonable.
> See the attached patch:
>
> With this patch applied the resulting code is much better again:
>
> test:
>         adrp    x1, b
>         sxtw    x2, w0
>         adrp    x3, a
>         mov     w5, 24
>         add     x3, x3, :lo12:a
>         ldr     x4, [x1, #:lo12:b]
>         lsl     x1, x2, 2
>         sub     x1, x1, x2
>         lsl     x1, x1, 4
>         add     x2, x3, x1
>         smaddl  x0, w0, w5, x4
>         str     x0, [x3, x1]
>         add     x8, x0, 4
>         add     x7, x0, 8
>         add     x6, x0, 12
>         add     x5, x0, 16
>         add     x4, x0, 20
>         str     x8, [x2, 8]
>         str     x7, [x2, 16]
>         str     x6, [x2, 24]
>         str     x5, [x2, 32]
>         str     x4, [x2, 40]
>         ret
>
>
> Boot-Strapped and regression tested on X86_64 and arm-linux-gnueabihf.
> Ok for trunk?


But the coment previously read

          /* A constant address in TO_RTX can have VOIDmode, we must not try
             to call force_reg for that case.  Avoid that case.  */

and you are removing that check.  So I guess you want to retain

  && GET_MODE (XEXP (to_rtx, 0)) != VOIDmode

or investigate why we don't need to avoid this case anymore.  In fact,
that's probably the only change necessary, just drop the == BLKmode
check?  Your lengthy description doesn't reveal if you investigated that.
It seems to me it tries to avoid ajdusting MEM (symbol-ref) for example?
Maybe also for optimization reasons?

Adding a new comment before the block reflecting your analysis above
would be nice as well.

Thanks,
Richard.

>
> PS: I wonder if this patch should also go into the 4.9 branch, after a while 
> of course.
> What do you think?
>
>
> Thanks
> Bernd.
>

Re: [PATCH] Fix an AARCH64/ARM performance regression

Reply via email to