Hi, I have been informed, that the following check-in may cause a slight performance regression on aarch64 and arm on trunk and gcc-4_9-branch: See https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=205899
This can be seen with the following test example:
test.c:
typedef struct
{
int *x;
int *y;
int *z;
int *m;
int *n;
int *o;
}A;
typedef struct
{
int x;
int y;
int z;
int m;
int n;
int o;
}B;
A a[6];
B *b;
void test(int i)
{
A *a1 = &a[i];
B *b1 = &b[i];
a1->x = &b1->x;
a1->y = &b1->y;
a1->z = &b1->z;
a1->m = &b1->m;
a1->n = &b1->n;
a1->o = &b1->o;
}
when compiled with aarch64-elf-gcc -O3 -S we get the following
assembler code:
test:
adrp x1, b
sxtw x3, w0
mov w5, 24
adrp x2, a
add x2, x2, :lo12:a
ldr x4, [x1, #:lo12:b]
lsl x1, x3, 2
sub x1, x1, x3
lsl x1, x1, 4
smaddl x0, w0, w5, x4
add x3, x2, x1
add x7, x0, 8
add x6, x0, 16
add x8, x0, 4
add x5, x0, 12
add x4, x0, 20
str x0, [x2, x1]
mov x0, x3
mov x1, x3
str x8, [x3, 8]
str x7, [x0, 16]!
str x6, [x1, 32]!
str x5, [x0, 8]
str x4, [x1, 8]
ret
Note the two mov instructions, and that two extra registers are used
to store the values. The mov instructions have not been there before the patch.
After some investigation I found out how this happened:
The difference is first visible with -fdump-rtl-expand-slim:
1: NOTE_INSN_DELETED
4: NOTE_INSN_BASIC_BLOCK 2
2: r83:SI=x0:SI
3: NOTE_INSN_FUNCTION_BEG
6: r74:DI=sign_extend(r83:SI)
7: r85:DI=high(`b')
8: r84:DI=r85:DI+low(`b')
REG_EQUAL `b'
9: r87:SI=0x18
10: r86:DI=sign_extend(r83:SI)*sign_extend(r87:SI)
11: r88:DI=[r84:DI]
12: r76:DI=r88:DI+r86:DI
13: r90:DI=high(`a')
14: r89:DI=r90:DI+low(`a')
REG_EQUAL `a'
15: r91:DI=sign_extend(r83:SI)
16: r92:DI=r91:DI
17: r93:DI=r92:DI<<0x2
18: r94:DI=r93:DI-r91:DI
REG_EQUAL r91:DI*0x3
19: r95:DI=r94:DI<<0x4
20: r94:DI=r95:DI
REG_EQUAL r91:DI*0x30
21: r96:DI=r89:DI+r94:DI
22: [r96:DI]=r76:DI
23: r98:DI=high(`a')
24: r97:DI=r98:DI+low(`a')
REG_EQUAL `a'
25: r99:DI=sign_extend(r83:SI)
26: r100:DI=r99:DI
27: r101:DI=r100:DI<<0x2
28: r102:DI=r101:DI-r99:DI
REG_EQUAL r99:DI*0x3
29: r103:DI=r102:DI<<0x4
30: r102:DI=r103:DI
REG_EQUAL r99:DI*0x30
31: r104:DI=r97:DI+r102:DI
32: r105:DI=r76:DI+0x4
33: [r104:DI+0x8]=r105:DI
34: r107:DI=high(`a')
35: r106:DI=r107:DI+low(`a')
REG_EQUAL `a'
36: r108:DI=sign_extend(r83:SI)
37: r109:DI=r108:DI
38: r110:DI=r109:DI<<0x2
39: r111:DI=r110:DI-r108:DI
REG_EQUAL r108:DI*0x3
40: r112:DI=r111:DI<<0x4
41: r111:DI=r112:DI
REG_EQUAL r108:DI*0x30
42: r113:DI=r106:DI+r111:DI
43: r114:DI=r113:DI+0x10
44: r115:DI=r76:DI+0x8
45: [r114:DI]=r115:DI
46: r117:DI=high(`a')
47: r116:DI=r117:DI+low(`a')
REG_EQUAL `a'
48: r118:DI=sign_extend(r83:SI)
49: r119:DI=r118:DI
50: r120:DI=r119:DI<<0x2
51: r121:DI=r120:DI-r118:DI
REG_EQUAL r118:DI*0x3
52: r122:DI=r121:DI<<0x4
53: r121:DI=r122:DI
REG_EQUAL r118:DI*0x30
54: r123:DI=r116:DI+r121:DI
55: r124:DI=r123:DI+0x10
56: r125:DI=r76:DI+0xc
57: [r124:DI+0x8]=r125:DI
58: r127:DI=high(`a')
59: r126:DI=r127:DI+low(`a')
REG_EQUAL `a'
60: r128:DI=sign_extend(r83:SI)
61: r129:DI=r128:DI
62: r130:DI=r129:DI<<0x2
63: r131:DI=r130:DI-r128:DI
REG_EQUAL r128:DI*0x3
64: r132:DI=r131:DI<<0x4
65: r131:DI=r132:DI
REG_EQUAL r128:DI*0x30
66: r133:DI=r126:DI+r131:DI
67: r134:DI=r133:DI+0x20
68: r135:DI=r76:DI+0x10
69: [r134:DI]=r135:DI
70: r137:DI=high(`a')
71: r136:DI=r137:DI+low(`a')
REG_EQUAL `a'
72: r138:DI=sign_extend(r83:SI)
73: r139:DI=r138:DI
74: r140:DI=r139:DI<<0x2
75: r141:DI=r140:DI-r138:DI
REG_EQUAL r138:DI*0x3
76: r142:DI=r141:DI<<0x4
77: r141:DI=r142:DI
REG_EQUAL r138:DI*0x30
78: r143:DI=r136:DI+r141:DI
79: r144:DI=r143:DI+0x20
80: r145:DI=r76:DI+0x14
81: [r144:DI+0x8]=r145:DI
Note the offset +0x8 on the instructions 33, 57, 81 but not on 22, 45, 69.
This artefact was not there before the patch.
Well, this causes little ripples in the later rtl-passes.
There is one pass that does a pretty good job at compensating
this effect: forward-propagate. In simple cases the resulting code
does not look any different, because fwprop folds all offsets together,
and the result looks as before. However in this example fwprop could
not remove some "dead" statements, and that made the auto-inc-dec pass
insert the mov statements.
So, what caused this?
You probably remember this statement in expr.c, which we did not fully
understand, but we wanted not to remove it because if could have subltle
effects on the code quality:
/* The check for a constant address in TO_RTX not having VOIDmode
is probably no longer necessary. */
if (MEM_P (to_rtx)
&& GET_MODE (to_rtx) == BLKmode
&& GET_MODE (XEXP (to_rtx, 0)) != VOIDmode
&& bitsize> 0
&& (bitpos % bitsize) == 0
&& (bitsize % GET_MODE_ALIGNMENT (mode1)) == 0
&& MEM_ALIGN (to_rtx) == GET_MODE_ALIGNMENT (mode1))
{
to_rtx = adjust_address (to_rtx, mode1, bitpos / BITS_PER_UNIT);
bitregion_start = 0;
if (bitregion_end>= (unsigned HOST_WIDE_INT) bitpos)
bitregion_end -= bitpos;
bitpos = 0;
}
Now I see, my patch had an influence on the trigger condition for this
statement:
if (MEM_P (to_rtx))
{
- if (volatilep && flag_strict_volatile_bitfields> 0)
+ if (mode1 != VOIDmode)
to_rtx = adjust_address (to_rtx, mode1, 0);
else if (GET_MODE (to_rtx) == VOIDmode)
As it turns out, this made GET_MODE (to_rtx) be unequal to BLKmode
in most cases.
In this example, get_inner_reference returns offset != NULL and
bitpos is either 0 or 64, for every second structure member.
When the adjust_address here is executed, this bitpos is folded into
the to_rtx, and the generated rtl is much more uniform.
So the condition for the if-statement has to be re-written, to
munge the bitpos into the ro_rtx, together with the offset, if
store_field can be expected to emit a single move instruction:
that means primarily the alignment is OK.
It should not depend on volatileness and a larger alignment
should also be no reason to bail out here.
Sorry for this lengthy explanation!
So I'd like to re-write the condition here, to something reasonable.
See the attached patch:
With this patch applied the resulting code is much better again:
test:
adrp x1, b
sxtw x2, w0
adrp x3, a
mov w5, 24
add x3, x3, :lo12:a
ldr x4, [x1, #:lo12:b]
lsl x1, x2, 2
sub x1, x1, x2
lsl x1, x1, 4
add x2, x3, x1
smaddl x0, w0, w5, x4
str x0, [x3, x1]
add x8, x0, 4
add x7, x0, 8
add x6, x0, 12
add x5, x0, 16
add x4, x0, 20
str x8, [x2, 8]
str x7, [x2, 16]
str x6, [x2, 24]
str x5, [x2, 32]
str x4, [x2, 40]
ret
Boot-Strapped and regression tested on X86_64 and arm-linux-gnueabihf.
Ok for trunk?
PS: I wonder if this patch should also go into the 4.9 branch, after a while of
course.
What do you think?
Thanks
Bernd.
patch-expr.diff
Description: Binary data
2014-05-27 Bernd Edlinger <[email protected]> * expr.c (expand_assignment): Fold the bitpos in the to_rtx if sufficiently aligned and an offset is used at the same time. (expand_expr_real_1): Likewise.
