> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Saturday, October 23, 2021 11:40 AM > To: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> > Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; nd <n...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com> > Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants > and operations > > Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > >> I'm still a bit sceptical about treating the high-part cost as lower. > >> ISTM that the subreg cases are the ones that are truly “free” and any > >> others should have a normal cost. So if CSE handled the subreg case > >> itself (to model how the rtx would actually be generated) then > >> aarch64 code would have to do less work. I imagine that will be true for > other targets as well. > > > > I guess the main problem is that CSE lacks context because it's not > > until after combine that the high part becomes truly "free" when pushed > into a high operation. > > Yeah. And the aarch64 code is just being asked to cost the operation it's > given, which could for example come from an existing > aarch64_simd_mov_from_<mode>high. I think we should try to ensure that > a aarch64_simd_mov_from_<mode>high followed by some arithmetic on > the result is more expensive than the fused operation (when fusing is > possible). > > An analogy might be: if the cost code is given: > > (add (reg X) (reg Y)) > > then, at some later point, the (reg X) might be replaced with a > multiplication, > in which case we'd have a MADD operation and the addition is effectively > free. Something similar would happen if (reg X) became a shift by a small > amount on newer cores, although I guess then you could argue either that > the cost of the add disappears or that the cost of the shift disappears. > > But we shouldn't count ADD as free on the basis that it could be combined > with a multiplication or shift in future. We have to cost what we're given. > I > think the same thing applies to the high part. > > Here we're trying to prevent cse1 from replacing a DUP (lane) with a MOVI > by saying that the DUP is strictly cheaper than the MOVI. > I don't think that's really true though, and the cost tables in the patch say > that > DUP is more expensive (rather than less expensive) than MOVI.
No we're not. The front end has already pushed the constant into each operation that needs it which is the entire problem. MOVI as I mentioned before is the one case where this is a toss up. But there are far more constants that cannot be created with a movi. A simple example is #include <arm_neon.h> int8x16_t square(int8x16_t full, int8x8_t small) { int8x16_t cst = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,15}; int8x8_t low = vget_high_s8 (cst); int8x8_t res1 = vmul_s8 (small, low); return vaddq_s8 (vmulq_s8 (full, cst), vcombine_s8 (res1, res1)); } Where in Gimple we get <bb 2> [local count: 1073741824]: _2 = __builtin_aarch64_get_highv16qi ({ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 }); _4 = _2 * small_3(D); _6 = full_5(D) * { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 0 }; _7 = __builtin_aarch64_combinev8qi (_4, _4); _8 = _6 + _7; return _8; Regardless of what happens to __builtin_aarch64_get_highv16qi nothing will recreate the relationship with cst, whether __builtin_aarch64_get_highv16qi is lowered or not, constant prop will still push in constants. This codegen results in us rematerializing the constant twice. square: adrp x0, .LC0 ldr d2, [x0, #:lo12:.LC0] adrp x0, .LC1 ldr q3, [x0, #:lo12:.LC1] mul v1.8b, v2.8b, v1.8b dup d2, v1.d[0] ins v2.d[1], v1.d[0] mla v2.16b, v0.16b, v3.16b mov v0.16b, v2.16b ret .LC1: .byte 0 .byte 1 .byte 2 .byte 3 .byte 4 .byte 5 .byte 6 .byte 7 .byte 8 .byte 9 .byte 10 .byte 11 .byte 12 .byte 13 .byte 15 .byte 0 Regardless whether it's pushed into a high operation or not this codegen it's still far more expensive to do this codegen. > > Also, if I've understood correctly, it looks like we'd be relying on the > vget_high of a constant remaining unfolded until RTL cse1. > I think it's likely in future that we'd try to fold vget_high at the gimple > level > instead, since that could expose more optimisations of a different kind. The > gimple optimisers would then fold vget_high(constant) in a similar way to > cse1 does now. > > So perhaps we should continue to allow the vget_high(constant) to be > foloded in cse1 and come up with some way of coping with the folded form. CSE1 doesn't fold it, because for CSE the cost is too high to do so. Which is what this costing was attempting to fix. CSE simply does not touch it. It leaves it as (insn 11 10 12 2 (set (reg:V16QI 95 [ _7 ]) (vec_concat:V16QI (vec_select:V8QI (reg:V16QI 95 [ _7 ]) (parallel:V16QI [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ])) (reg:V8QI 93 [ _4 ]))) "":6506:10 1908 {aarch64_simd_move_hi_quad_v16qi} (nil)) (insn 12 11 13 2 (set (reg:V16QI 102) (const_vector:V16QI [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 15 [0xf]) (const_int 0 [0]) ])) "":1466:14 1166 {*aarch64_simd_movv16qi} (nil)) And I don't see any way to fix this without having Gimple not push constants in, which would lead to worse regressions. I can change the patch to cost the high as a dup which fixes this codegen at least and has you rematerialize movi. If that's not acceptable I can drop costing for High entirely then, it's not the main thing I am fixing. Tamar > > Thanks, > Richard