Sudakshina Das <sudi....@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: 30 October 2020 19:56 >> To: Sudakshina Das <sudi....@arm.com> >> Cc: Wilco Dijkstra <wilco.dijks...@arm.com>; gcc-patches@gcc.gnu.org; >> Kyrylo Tkachov <kyrylo.tkac...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com> >> Subject: Re: [PATCH] aarch64: Add backend support for expanding >> __builtin_memset >> >> > + base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = >> > + adjust_automodify_address (dst, VOIDmode, base, 0); >> > + >> > + /* Prepare the val using a DUP v0.16B, val. */ if (CONST_INT_P >> > + (val)) >> > + { >> > + val = force_reg (QImode, val); >> > + } >> > + src = gen_reg_rtx (V16QImode); >> > + emit_insn (gen_aarch64_simd_dupv16qi(src, val)); >> >> I think we should use: >> >> src = expand_vector_broadcast (V16QImode, val); >> >> here (without the CONST_INT_P check), so that for constants we just move a >> constant directly into a register. >> > > Sorry to bring this up again. When I tried expand_vector_broadcast, I > see the following behaviour: > for __builtin_memset(p, 1, 24) where the duplicated constant fits > movi v0.16b, 0x1 > mov x1, 72340172838076673 > str x1, [x0, 16] > str q0, [x0] > and an ICE for __builtin_memset(p, 1, 32) where I am guessing the duplicated > constant does not fit > x.c:7:30: error: unrecognizable insn: > 7 | { __builtin_memset(p, 1, 32);} > | ^ > (insn 8 7 0 2 (parallel [ > (set (mem:V16QI (reg:DI 94) [0 MEM <char[1:32]> [(void > *)p_2(D)]+0 S16 A8]) > (const_vector:V16QI [ > (const_int 1 [0x1]) repeated x16 > ])) > (set (mem:V16QI (plus:DI (reg:DI 94) > (const_int 16 [0x10])) [0 MEM <char[1:32]> [(void > *)p_2(D)]+16 S16 A8]) > (const_vector:V16QI [ > (const_int 1 [0x1]) repeated x16 > ])) > ]) "x.c":7:3 -1 > (nil)) > during RTL pass: vregs
Ah, yeah, I guess we need to call force_reg on the result. >> So yeah, I'm certainly not questioning the speed_p value of 256. >> I'm sure you and Wilco have picked the best value for that. But -Os stuff >> can >> usually be justified on first principles and I wasn't sure where the value >> of 128 >> came from. >> > > I had another chat with Wilco about the 128byte value for !speed_p. We > estimate the average number of instructions upto 128byte would be ~3 which > is similar to do a memset call. But I did go back and think about the tuning > argument of AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a bit more because > you are right that based on that the average instructions can become double. > I would propose using 256/128 based on speed_p but halving the value based on > the > tune parameter. Obviously the assumption here is that we are respecting the > core's > choice of avoiding stp of q registers (given that I do not see other uses of > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os). Yeah, but I think the lack of an -Os check in the existing code might be a mistake. The point is that STP Q is smaller than two separate STR Qs, so using it is a size optimisation even if it's not a speed optimisation. And like I say, -Os isn't supposed to be striking a balance between size and speed: it's supposed to be going for size quite aggressively. So TBH I have slight preference for keeping the current value and only checking the tuning flag for speed_p. But I agree that halving the value would be self-consistent, so if you or Wilco believe strongly that halving is better, that'd be OK with me too. > There might be a debate on how useful AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS > is in the context of memset/memcpy but that needs more analysis and I would > say should be a separate patch. Agreed. >> >> > + if (n > 0 && n < copy_limit / 2) >> >> > + { >> >> > + next_mode = smallest_mode_for_size (n, MODE_INT); >> >> > + /* Last 1-byte causes the compiler to optimize to STRB when it >> >> should >> >> > + use STR Bx, [mem] since we already used SIMD registers. >> >> > + So force it to HImode. */ >> >> > + if (next_mode == QImode) >> >> > + next_mode = HImode; >> >> >> >> Is this always better? E.g. for variable inputs and zero it seems >> >> quite natural to store the original scalar GPR. >> >> >> >> If we do do this, I think we should assert before the loop that n > 1. >> >> >> >> Also, it would be good to cover this case in the tests. >> > >> > To give a background on this: >> > So the case in point here is when we are copying the _last_ 1 byte. So >> > the following Void foo (void *p) { __builtin_memset (p, 1, 3); } The >> > compiler was generating >> > movi v0.16b, 0x1 >> > mov w1, 1 >> > strb w1, [x0, 2] >> > str h0, [x0] >> > ret >> > This is because after my expansion in subsequent passes it would see >> > (insn 13 12 14 2 (set (reg:QI 99) >> > (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1 >> > (nil)) >> > (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93) >> > (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void *)p_2(D)]+2 >> > S1 A8]) >> > (reg:QI 99)) "x.c":3:3 -1 >> > (nil)) >> > And "optimize" it away to strb with an extra mov. Ideally this is a >> > separate patch to fix this somewhere between cse1 and fwprop1 and emit >> > movi v0.16b, 0x1 >> > str h0, [x0] >> > str b0, [x0, 2] >> > ret >> > This force to HImode was my temporary workaround for now and we >> generate: >> > movi v0.16b, 0x1 >> > str h0, [x0] >> > str h0, [x0, 1] >> > ret >> > >> > I hope this clarifies things. >> >> Yeah, this was the case I was expecting it was aimed at, and I can see why we >> want to do it (at least for -Os). My concern was more about: >> >> > After we have used a MOVI/DUP (for variable or non-zero constant >> > cases), which is needed for anything above 1 byte memset, it isn't >> > really beneficial to use GPR regs. >> >> I guess this is all very uarch-dependent, but it seemed odd for, say: >> >> memset (x, y, 9); >> >> to generate: >> >> dup v0.8b, w1 >> str q0, [x0] >> strh h0, [x0, #7] >> >> with an overlapping store and an extra dependency on the input to the final >> store, instead of: >> >> dup v0.8b, w1 >> str q0, [x0] >> strb w1, [x0, #8] >> >> > For zero case, the compiler in later passes seems to figure out using >> > wzr despite this change but uses strh instead of strb. For example for zero >> setting 65-bytes. >> > movi v0.4s, 0 >> > stp q0, q0, [x0, 32] >> > stp q0, q0, [x0] >> > strh wzr, [x0, 63] >> > ret >> >> Here too it just feels more natural to use an strb at offset 64. >> I know that's not a good justification though. :-) >> >> Also, I'm not sure how robust this workaround will be. The RTL optimisers >> might get “smarter” and see through the HImode version too. >> > > Hmm you are right that maybe taking the overlapping store and extra > dependency on > the input is not ideal. I am happy to keep the original code (no forcing > HImode) > and add the following 2 cases below in a separate bug report as improvements > after > the patch gets it. > > Extra mov with non-zero const. > movi v0.16b, 0x1 > mov w1, 1 > strb w1, [x0, 2] > str h0, [x0] > > Not using strb with var > dup v0.8b, w1 > str h0, [x0] > str bo, [x0, 2] > (So where I would expect it to prefer STRB, it does the opposite!) Huh, yeah. I guess it's currently not able to see through the subreg. Thanks, Richard