I tried extra fwprop pass and got some very interesting results!
First "caveat" I just cut/pasted extra pass into list - nor worrying
about detail.
NEXT_PASS (pass_rtl_fwprop);
NEXT_PASS (pass_local_alloc);
To show effects here is assembler code dump (which is easier to read
than RTL)
(1)Just splitters - normal passes for O3 - no attempt to remove Or rx,0
(without splitters its almost 2x bigger)
23 /* prologue: function */
24 /* frame size = 0 */
25 0000 FC01 movw r30,r24
26 .LM2:
27 0002 9181 ldd r25,Z+1
28 0004 80E0 ldi r24,lo8(0)
29 .LVL1:
30 0006 60E0 ldi r22,lo8(0)
31 0008 2281 ldd r18,Z+2
32 000a 30E0 ldi r19,lo8(0)
33 000c 6060 ori r22,lo8(0)
34 000e 762F mov r23,r22
35 0010 822B or r24,r18
36 0012 932B or r25,r19
37 0014 2481 ldd r18,Z+4
38 0016 622B or r22,r18
39 0018 7060 ori r23,lo8(0)
40 001a 8060 ori r24,lo8(0)
41 001c 9060 ori r25,lo8(0)
42 001e 2381 ldd r18,Z+3
43 0020 40E0 ldi r20,lo8(0)
44 0022 6060 ori r22,lo8(0)
45 0024 722B or r23,r18
46 0026 832B or r24,r19
47 0028 942B or r25,r20
48 /* epilogue start */
49 .LM3:
50 002a 0895 ret
(2)Same code but now with fwprop:
23 /* prologue: function */
24 /* frame size = 0 */
25 0000 FC01 movw r30,r24
26 .LM2:
27 0002 4181 ldd r20,Z+1
28 0004 942F mov r25,r20
29 0006 70E0 ldi r23,lo8(0)
30 0008 3281 ldd r19,Z+2
31 000a 832F mov r24,r19
32 .LVL1:
33 000c 9060 ori r25,lo8(0)
34 000e 2481 ldd r18,Z+4
35 0010 622F mov r22,r18
36 0012 8060 ori r24,lo8(0)
37 0014 2381 ldd r18,Z+3
38 0016 722B or r23,r18
39 /* epilogue start */
40 .LM3:
41 0018 0895 ret
Much better! But note we still have OR rx,0 created. (There were none
before
fwprop pass.) As there are still obvious propagation oppertunities I
suspect that these are being added by local-alloc propagation after
imperfect fwprop.
(4)Now with fwprop and NOP splitter for OR rx,0
23 /* prologue: function */
24 /* frame size = 0 */
25 0000 FC01 movw r30,r24
26 .LM2:
27 0002 4181 ldd r20,Z+1
28 0004 942F mov r25,r20
29 0006 70E0 ldi r23,lo8(0)
30 0008 3281 ldd r19,Z+2
31 000a 832F mov r24,r19
32 .LVL1:
33 000c 2481 ldd r18,Z+4
34 000e 622F mov r22,r18
35 0010 2381 ldd r18,Z+3
36 0012 722B or r23,r18
37 /* epilogue start */
38 .LM3:
39 0014 0895 ret
No diference apart from OR Rx,0 removal. (I expected that)
(5) And just for the hell of it 2 passes of fwprop before local-alloc.
No NOP splitter.
NEXT_PASS (pass_rtl_fwprop);
NEXT_PASS (pass_rtl_fwprop);
NEXT_PASS (pass_local_alloc);
23 /* prologue: function */
24 /* frame size = 0 */
25 0000 FC01 movw r30,r24
26 .LM2:
27 0002 9181 ldd r25,Z+1
28 0004 8281 ldd r24,Z+2
29 .LVL1:
30 0006 6481 ldd r22,Z+4
31 0008 7381 ldd r23,Z+3
32 /* epilogue start */
33 .LM3:
34 000a 0895 ret
Which is optimal. TADA!
This would indicate that simplify-rtx inside fwprop is removing OR Rx,0
but not picking up the the additionally revealed forward propagation
oppertunities
This would seem to be an avoidable limitation.
Andy