Hi,

On Wed, 2014-03-26 at 08:58 +0100, Christian Bruel wrote:

> This patches adds a few instructions to the inlined builtin_strlen to
> unroll the remaining bytes for word-at-a-time loop. This enables to have
> 2 distinct execution paths (no fall-thru in the byte-at-a-time loop),
> allowing block alignment assignation. This partially improves the
> problem reported with by Oleg. in [Bug target/0539] New: [SH] builtin
> string functions ignore loop and label alignment

Actually, my original concern was the (mis)alignment of the 4 byte inner
loop.  AFAIR it's better for the SH pipeline if the first insn of a loop
is 4 byte aligned.

> 
> whereas the test now expands (-O2 -m4) as
>         mov     r4,r0
>         tst     #3,r0
>         mov     r4,r2
>         bf/s    .L12
>         mov     r4,r3
>         mov     #0,r2
> .L4:
>         mov.l   @r4+,r1
>         cmp/str r2,r1
>         bf      .L4
>         add     #-4,r4
>         mov.b   @r4,r1
>         tst     r1,r1
>         bt      .L2
>         add     #1,r4
>         mov.b   @r4,r1
>         tst     r1,r1
>         bt      .L2
>         add     #1,r4
>         mov.b   @r4,r1
>         tst     r1,r1
>         mov     #-1,r1
>         negc    r1,r1
>         add     r1,r4
> .L2:
>         mov     r4,r0
>         rts
>         sub     r3,r0
>         .align 1
> .L12:
>         mov.b   @r4+,r1
>         tst     r1,r1
>         bf/s    .L12
>         mov     r2,r3
>         add     #1,r3
>         mov     r4,r0
>         rts
>         sub     r3,r0
> 
> 
> Best tuning compared to the "compact" version I got on is ~1% for c++
> regular expression benchmark, but well, code looks best this way.

I haven't done any measurements but doesn't this introduce some
performance regressions here and there due to the increased code size?
Maybe the byte unrolling should not be done at -O2 but at -O3?

Moreover, post-inc addressing on the bytes could be used.  Ideally we'd
get something like this:

        mov     r4,r0
        tst     #3,r0
        bf/s    .L12
        mov     r4,r3
        mov     #0,r2
.L4:
        mov.l   @r4+,r1
        cmp/str r2,r1
        bf      .L4

        add     #-4,r4

        mov.b   @r4+,r1
        tst     r1,r1
        bt      .L2

        mov.b   @r4+,r1
        tst     r1,r1
        bt      .L2

        mov.b   @r4+,r1
        tst     r1,r1
        mov     #-1,r1
        subc    r1,r4
        sett
.L2:
        mov     r4,r0
        rts
        subc    r3,r0
        .align 1
.L12:
        mov.b   @r4+,r1
        tst     r1,r1
        bf     .L12

        mov     r4,r0
        rts
        subc    r3,r0


I'll have a look at the missed 'subc' cases.

Cheers,
Oleg

Reply via email to