Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

Tamar Christina Wed, 20 Jun 2018 06:13:46 -0700

Hi James,

Many thanks for the review!

The 06/19/2018 22:23, James Greenhalgh wrote:
> On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> > Hi All,
> > 
> > This changes the movmem code in AArch64 that does copy for data between 4 
> > and 7
> > bytes to use the smallest possible mode capable of copying the remaining 
> > bytes.
> > 
> > for a 5 byte copy.
> 
> So... Given that I wrote the code to do the overlapping memmove back in 2014
> https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about
> why we want to achieve this transformation. I can see that we come out
> about even for a 5 or 6 byte copy, but your patch would also force it for a
> 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases
> probably weren't though through that closely.
> 

Actually, the 7 case stays the same.  This patch will generate just as before

        ldr     w1, [x0, 48]
        ldr     w0, [x0, 51]
        str     w1, [sp, 8]
        str     w0, [sp, 11]

This patch should never generate more loads or stores than it did before, in 
fact
it strictly generates less or the same amount of loads/stores.  Just in less 
code.
The reason this is the case is because of how the algorithm works.

This can be proved by induction. My hypothesis is that this algorithm never 
issues
more than 2 loads and 2 stores per copies of n % 32 bytes.

The case of n - (n % 32), the code will as before satisfy these using loads and 
stores
of TImode mode, issuing as many as required to get to n < 31. 

Which leaves the cases of n < 31 bytes.

the cases n > 24 and n < 31 will be done using overlapping TImode operations 
now,
whereas before they would issue more operations. e.g. n = 25 will now now 
overlap
where before it would have done 16 + 8 + 1. Another example, n == 27 before 
would
have done 16 + 8 + 4 (where the 4 overlaps). Now it will do 16 + 16 where the 
second
16 overlaps.

etc.

It does this by always doing the first copy using the largest mode that doesn't 
over
read. So for your 7 bytes example before, this would be a SImode. In the loop 
it selects
the smallest mode that is larger or equal to the remainder. SImode is the 
smallest mode
larger than 3 bytes. So it issues the same two SImode operations as before.

I had described this in the comments but if this wasn't clear I can expand on 
it.

> I'd like to see a testcase which shows the advantages for CSE you describe
> in your introduction to the patch; I'm not sure your testcase below gives
> that.

This CSE relies on a mid-end change also being accepted. This would be
https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01088.html which was reverted do 
to
it not handling corner cases correctly 
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00239.html
I will am performing regression testing on the new version and will submit it 
soon.

If we take a more useful example of a program than the ones in the testcase and 
cover letter:

struct struct5  fun5()
{
  return foo5;
}

where foo5 is a simple 5 byte struct.

after SSA this becomes

struct struct5  fun5()
{
  D.XXXX = foo5
  return D.XXXX;
}

Which means two copies will be done, first one into D.XXXX and the second one 
by the return.
One handled by the mid-end and one by this back-end code.

Before this patch we generated at -O3 16 insn:

fun5:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        mov     x0, 0
        sub     sp, sp, #16
        ldrb    w2, [x1, 32]
        ldr     w1, [x1, 33]
        add     sp, sp, 16
        bfi     x0, x2, 0, 8
        ubfx    x3, x1, 8, 8
        bfi     x0, x1, 8, 8
        ubfx    x2, x1, 16, 8
        lsr     w1, w1, 24
        bfi     x0, x3, 16, 8
        bfi     x0, x2, 24, 8
        bfi     x0, x1, 32, 8
        ret

With this patch alone 16 insn no regression, but no improvement either:

fun5:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        mov     x0, 0
        sub     sp, sp, #16
        ldr     w1, [x2, 32]
        ldrb    w2, [x2, 36]
        add     sp, sp, 16
        ubfx    x4, x1, 8, 8
        bfi     x0, x1, 0, 8
        ubfx    x3, x1, 16, 8
        lsr     w1, w1, 24
        bfi     x0, x4, 8, 8
        bfi     x0, x3, 16, 8
        bfi     x0, x1, 24, 8
        bfi     x0, x2, 32, 8
        ret

And with this patch and mid-end patch this becomes 10 insn:

fun5:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        mov     x0, 0
        sub     sp, sp, #16
        ldr     w2, [x1, 32]
        ldrb    w1, [x1, 36]
        add     sp, sp, 16
        bfi     x0, x2, 0, 32
        bfi     x0, x1, 32, 8
        ret

with the mid-end patch alone and the existing implementation it would be 14 
insn:

fun5:
        adrp    x1, .LANCHOR0
        add     x1, x1, :lo12:.LANCHOR0
        sub     sp, sp, #16
        mov     x0, 0
        ldr     w2, [x1, 32]
        ldr     w1, [x1, 33]
        str     w2, [sp, 8]
        str     w1, [sp, 9]
        ldr     w2, [sp, 8]
        ldrb    w1, [sp, 12]
        add     sp, sp, 16
        bfi     x0, x2, 0, 32
        bfi     x0, x1, 32, 8
        ret

and the differences become more pronounced with the larger ones,like the 10 
bytes copy case saves 18 instructions.

Every function handled by this code path becomes smaller. I can provide the 
comparison data if you want to see them.

> 
> > Regtested on aarch64-none-elf and .. issues.
> 
> Do I draw a zero on the line on my screen with a sharpie or what ;-)

Sorry, seems I had forgotten to fill this in. My apologies.
This should be 0 regressions.

> 
> > Ok for trunk?
> 
> Justification and performance impact would be appreciated. Right now I'm not
> convinced this is forward progress.
> 
> > gcc/
> > 2018-06-19  Tamar Christina  <tamar.christ...@arm.com>
> > 
> >     * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> > 
> > gcc/testsuite/
> > 2018-06-19  Tamar Christina  <tamar.christ...@arm.com>
> > 
> >     * gcc.target/aarch64/struct_cpy.c: New.
> > 
> > -- 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 
> > bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064
> >  100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> > *src, rtx *dst,
> >  bool
> >  aarch64_expand_movmem (rtx *operands)
> >  {
> > -  unsigned int n;
> > +  int n, mode_bits;
> >    rtx dst = operands[0];
> >    rtx src = operands[1];
> >    rtx base;
> > +  machine_mode cur_mode = BLKmode, next_mode;
> >    bool speed_p = !optimize_function_for_size_p (cfun);
> >  
> >    /* When optimizing for size, give a better estimate of the length of a
> > -     memcpy call, but use the default otherwise.  */
> > -  unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2;
> > +     memcpy call, but use the default otherwise.  Moves larger than 8 bytes
> > +     will always require an even number of instructions to do now.  And 
> > each
> > +     operation requires both a load+store, so devide the max number by 2.  
> > */
> > +  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
> 
> Why did we go from unsigned to signed int?

Because n is now signed. It made it easier to work with in the loop, and n will 
never be large enough
to not fit in a signed integer as then this copy routine wouldn't be called.  
So I wanted to avoid
casts back and forth between signed and unsigned.  I don't believe we benefit 
from using unsigned
types here.

> 
> Thanks,
> James
> 

--

Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

Reply via email to