Hi, On Wed, 2014-03-26 at 08:58 +0100, Christian Bruel wrote:
> This patches adds a few instructions to the inlined builtin_strlen to > unroll the remaining bytes for word-at-a-time loop. This enables to have > 2 distinct execution paths (no fall-thru in the byte-at-a-time loop), > allowing block alignment assignation. This partially improves the > problem reported with by Oleg. in [Bug target/0539] New: [SH] builtin > string functions ignore loop and label alignment Actually, my original concern was the (mis)alignment of the 4 byte inner loop. AFAIR it's better for the SH pipeline if the first insn of a loop is 4 byte aligned. > > whereas the test now expands (-O2 -m4) as > mov r4,r0 > tst #3,r0 > mov r4,r2 > bf/s .L12 > mov r4,r3 > mov #0,r2 > .L4: > mov.l @r4+,r1 > cmp/str r2,r1 > bf .L4 > add #-4,r4 > mov.b @r4,r1 > tst r1,r1 > bt .L2 > add #1,r4 > mov.b @r4,r1 > tst r1,r1 > bt .L2 > add #1,r4 > mov.b @r4,r1 > tst r1,r1 > mov #-1,r1 > negc r1,r1 > add r1,r4 > .L2: > mov r4,r0 > rts > sub r3,r0 > .align 1 > .L12: > mov.b @r4+,r1 > tst r1,r1 > bf/s .L12 > mov r2,r3 > add #1,r3 > mov r4,r0 > rts > sub r3,r0 > > > Best tuning compared to the "compact" version I got on is ~1% for c++ > regular expression benchmark, but well, code looks best this way. I haven't done any measurements but doesn't this introduce some performance regressions here and there due to the increased code size? Maybe the byte unrolling should not be done at -O2 but at -O3? Moreover, post-inc addressing on the bytes could be used. Ideally we'd get something like this: mov r4,r0 tst #3,r0 bf/s .L12 mov r4,r3 mov #0,r2 .L4: mov.l @r4+,r1 cmp/str r2,r1 bf .L4 add #-4,r4 mov.b @r4+,r1 tst r1,r1 bt .L2 mov.b @r4+,r1 tst r1,r1 bt .L2 mov.b @r4+,r1 tst r1,r1 mov #-1,r1 subc r1,r4 sett .L2: mov r4,r0 rts subc r3,r0 .align 1 .L12: mov.b @r4+,r1 tst r1,r1 bf .L12 mov r4,r0 rts subc r3,r0 I'll have a look at the missed 'subc' cases. Cheers, Oleg