Hi James, Many thanks for the review!
The 06/19/2018 22:23, James Greenhalgh wrote: > On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote: > > Hi All, > > > > This changes the movmem code in AArch64 that does copy for data between 4 > > and 7 > > bytes to use the smallest possible mode capable of copying the remaining > > bytes. > > > > for a 5 byte copy. > > So... Given that I wrote the code to do the overlapping memmove back in 2014 > https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00585.html I'm confused about > why we want to achieve this transformation. I can see that we come out > about even for a 5 or 6 byte copy, but your patch would also force it for a > 7 byte copy, adding an extra load instruction. That said, the 5 and 6 cases > probably weren't though through that closely. > Actually, the 7 case stays the same. This patch will generate just as before ldr w1, [x0, 48] ldr w0, [x0, 51] str w1, [sp, 8] str w0, [sp, 11] This patch should never generate more loads or stores than it did before, in fact it strictly generates less or the same amount of loads/stores. Just in less code. The reason this is the case is because of how the algorithm works. This can be proved by induction. My hypothesis is that this algorithm never issues more than 2 loads and 2 stores per copies of n % 32 bytes. The case of n - (n % 32), the code will as before satisfy these using loads and stores of TImode mode, issuing as many as required to get to n < 31. Which leaves the cases of n < 31 bytes. the cases n > 24 and n < 31 will be done using overlapping TImode operations now, whereas before they would issue more operations. e.g. n = 25 will now now overlap where before it would have done 16 + 8 + 1. Another example, n == 27 before would have done 16 + 8 + 4 (where the 4 overlaps). Now it will do 16 + 16 where the second 16 overlaps. etc. It does this by always doing the first copy using the largest mode that doesn't over read. So for your 7 bytes example before, this would be a SImode. In the loop it selects the smallest mode that is larger or equal to the remainder. SImode is the smallest mode larger than 3 bytes. So it issues the same two SImode operations as before. I had described this in the comments but if this wasn't clear I can expand on it. > I'd like to see a testcase which shows the advantages for CSE you describe > in your introduction to the patch; I'm not sure your testcase below gives > that. This CSE relies on a mid-end change also being accepted. This would be https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01088.html which was reverted do to it not handling corner cases correctly https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00239.html I will am performing regression testing on the new version and will submit it soon. If we take a more useful example of a program than the ones in the testcase and cover letter: struct struct5 fun5() { return foo5; } where foo5 is a simple 5 byte struct. after SSA this becomes struct struct5 fun5() { D.XXXX = foo5 return D.XXXX; } Which means two copies will be done, first one into D.XXXX and the second one by the return. One handled by the mid-end and one by this back-end code. Before this patch we generated at -O3 16 insn: fun5: adrp x1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 mov x0, 0 sub sp, sp, #16 ldrb w2, [x1, 32] ldr w1, [x1, 33] add sp, sp, 16 bfi x0, x2, 0, 8 ubfx x3, x1, 8, 8 bfi x0, x1, 8, 8 ubfx x2, x1, 16, 8 lsr w1, w1, 24 bfi x0, x3, 16, 8 bfi x0, x2, 24, 8 bfi x0, x1, 32, 8 ret With this patch alone 16 insn no regression, but no improvement either: fun5: adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 mov x0, 0 sub sp, sp, #16 ldr w1, [x2, 32] ldrb w2, [x2, 36] add sp, sp, 16 ubfx x4, x1, 8, 8 bfi x0, x1, 0, 8 ubfx x3, x1, 16, 8 lsr w1, w1, 24 bfi x0, x4, 8, 8 bfi x0, x3, 16, 8 bfi x0, x1, 24, 8 bfi x0, x2, 32, 8 ret And with this patch and mid-end patch this becomes 10 insn: fun5: adrp x1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 mov x0, 0 sub sp, sp, #16 ldr w2, [x1, 32] ldrb w1, [x1, 36] add sp, sp, 16 bfi x0, x2, 0, 32 bfi x0, x1, 32, 8 ret with the mid-end patch alone and the existing implementation it would be 14 insn: fun5: adrp x1, .LANCHOR0 add x1, x1, :lo12:.LANCHOR0 sub sp, sp, #16 mov x0, 0 ldr w2, [x1, 32] ldr w1, [x1, 33] str w2, [sp, 8] str w1, [sp, 9] ldr w2, [sp, 8] ldrb w1, [sp, 12] add sp, sp, 16 bfi x0, x2, 0, 32 bfi x0, x1, 32, 8 ret and the differences become more pronounced with the larger ones,like the 10 bytes copy case saves 18 instructions. Every function handled by this code path becomes smaller. I can provide the comparison data if you want to see them. > > > Regtested on aarch64-none-elf and .. issues. > > Do I draw a zero on the line on my screen with a sharpie or what ;-) Sorry, seems I had forgotten to fill this in. My apologies. This should be 0 regressions. > > > Ok for trunk? > > Justification and performance impact would be appreciated. Right now I'm not > convinced this is forward progress. > > > gcc/ > > 2018-06-19 Tamar Christina <tamar.christ...@arm.com> > > > > * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size. > > > > gcc/testsuite/ > > 2018-06-19 Tamar Christina <tamar.christ...@arm.com> > > > > * gcc.target/aarch64/struct_cpy.c: New. > > > > -- > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > > index > > bd0ac2f04d8f43fd54b58ff9581645493b8d0cd1..ed5409403741fa634d977ba15cd22741bb9d1064 > > 100644 > > --- a/gcc/config/aarch64/aarch64.c > > +++ b/gcc/config/aarch64/aarch64.c > > @@ -16180,26 +16180,29 @@ aarch64_copy_one_block_and_progress_pointers (rtx > > *src, rtx *dst, > > bool > > aarch64_expand_movmem (rtx *operands) > > { > > - unsigned int n; > > + int n, mode_bits; > > rtx dst = operands[0]; > > rtx src = operands[1]; > > rtx base; > > + machine_mode cur_mode = BLKmode, next_mode; > > bool speed_p = !optimize_function_for_size_p (cfun); > > > > /* When optimizing for size, give a better estimate of the length of a > > - memcpy call, but use the default otherwise. */ > > - unsigned int max_instructions = (speed_p ? 15 : AARCH64_CALL_RATIO) / 2; > > + memcpy call, but use the default otherwise. Moves larger than 8 bytes > > + will always require an even number of instructions to do now. And > > each > > + operation requires both a load+store, so devide the max number by 2. > > */ > > + int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2; > > Why did we go from unsigned to signed int? Because n is now signed. It made it easier to work with in the loop, and n will never be large enough to not fit in a signed integer as then this copy routine wouldn't be called. So I wanted to avoid casts back and forth between signed and unsigned. I don't believe we benefit from using unsigned types here. > > Thanks, > James > --