On Wed, Jul 12, 2023 at 12:23 PM Richard Sandiford <richard.sandif...@arm.com> wrote: > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak <ubiz...@gmail.com> wrote: > >> > >> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener > >> <richard.guent...@gmail.com> wrote: > >> > > >> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak <ubiz...@gmail.com> wrote: > >> > > > >> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener > >> > > <richard.guent...@gmail.com> wrote: > >> > > > > >> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches > >> > > > <gcc-patches@gcc.gnu.org> wrote: > >> > > > > > >> > > > > As shown in the PR, simplify_gen_subreg call in > >> > > > > simplify_replace_fn_rtx: > >> > > > > > >> > > > > (gdb) list > >> > > > > 469 if (code == SUBREG) > >> > > > > 470 { > >> > > > > 471 op0 = simplify_replace_fn_rtx (SUBREG_REG (x), > >> > > > > old_rtx, fn, data); > >> > > > > 472 if (op0 == SUBREG_REG (x)) > >> > > > > 473 return x; > >> > > > > 474 op0 = simplify_gen_subreg (GET_MODE (x), op0, > >> > > > > 475 GET_MODE (SUBREG_REG > >> > > > > (x)), > >> > > > > 476 SUBREG_BYTE (x)); > >> > > > > 477 return op0 ? op0 : x; > >> > > > > 478 } > >> > > > > > >> > > > > simplifies with following arguments: > >> > > > > > >> > > > > (gdb) p debug_rtx (op0) > >> > > > > (const_vector:V4QI [ > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x4 > >> > > > > ]) > >> > > > > (gdb) p debug_rtx (x) > >> > > > > (subreg:V16QI (reg:V4QI 98) 0) > >> > > > > > >> > > > > to: > >> > > > > > >> > > > > (gdb) p debug_rtx (op0) > >> > > > > (const_vector:V16QI [ > >> > > > > (const_int -52 [0xffffffffffffffcc]) repeated x16 > >> > > > > ]) > >> > > > > > >> > > > > This simplification is invalid, it is not possible to get > >> > > > > V16QImode vector > >> > > > > from V4QImode vector, even when all elements are duplicates. > >> > > >> > ^^^ > >> > > >> > I think this simplification is valid. A simplification to > >> > > >> > (const_vector:V16QI [ > >> > (const_int -52 [0xffffffffffffffcc]) repeated x4 > >> > (const_int 0 [0]) repeated x12 > >> > ]) > >> > > >> > would be valid as well. > >> > > >> > > > > The simplification happens in simplify_context::simplify_subreg: > >> > > > > > >> > > > > (gdb) list > >> > > > > 7558 if (VECTOR_MODE_P (outermode) > >> > > > > 7559 && GET_MODE_INNER (outermode) == GET_MODE_INNER > >> > > > > (innermode) > >> > > > > 7560 && vec_duplicate_p (op, &elt)) > >> > > > > 7561 return gen_vec_duplicate (outermode, elt); > >> > > > > > >> > > > > but the above simplification is valid only for non-paradoxical > >> > > > > registers, > >> > > > > where outermode <= innermode. We should not assume that elements > >> > > > > outside > >> > > > > the original register are valid, let alone all duplicates. > >> > > > > >> > > > Hmm, but looking at the audit trail the x86 backend expects them to > >> > > > be zero? > >> > > > Isn't that wrong as well? > >> > > > >> > > If you mean Comment #10, it is just an observation that > >> > > simplify_replace_rtx simplifies arguments from Comment #9 to: > >> > > > >> > > (gdb) p debug_rtx (src) > >> > > (const_vector:V8HI [ > >> > > (const_int 204 [0xcc]) repeated x4 > >> > > (const_int 0 [0]) repeated x4 > >> > > ]) > >> > > > >> > > instead of: > >> > > > >> > > (gdb) p debug_rtx (src) > >> > > (const_vector:V8HI [ > >> > > (const_int 204 [0xcc]) repeated x8 > >> > > ]) > >> > > > >> > > which is in line with the statement below. > >> > > > > >> > > > That is, I think putting any random value into the upper lanes when > >> > > > constant folding > >> > > > a paradoxical subreg sounds OK to me, no? > >> > > > >> > > The compiler is putting zero there as can be seen from the above new > >> > > RTX. > >> > > > >> > > > Of course we might choose to not do such constant propagation for > >> > > > efficiency reason - at least > >> > > > when the resulting CONST_* would require a larger constant pool entry > >> > > > or more costly > >> > > > construction. > >> > > > >> > > This is probably a follow-up improvement, where this patch tries to > >> > > fix a specific invalid simplification of simplify_replace_rtx that is > >> > > invalid universally. > >> > > >> > How so? What specifies the values of the paradoxical subreg for the > >> > bytes not covered by the subreg operand? > >> > >> I don't know why 0 is generated here (and if it is valid) for > >> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to > >> the wrong constant and triggers unwanted propagation later on. > > > > Quoting what I wrote in the PR below. I think pragmatically the fix is > > good - we might miss some opportunistic folding this way but we for > > sure may not optimistically register an equality via REG_EQUAL without > > enforcing it (removing the producer and replacing it with the optimistic > > constant). > > > > So consider the patch approved if no other RTL maintainer chimes in > > within 48h. > > Sorry, can you hold off a bit longer? Wanted to have a look but the > deadline is about to expire.
No problem, I will wait for you. Thanks, Uros. > > I think at least a comment is needed, since like Richard says, > the transformation seems correct for paradoxical subregs, even if it > isn't wanted for other reasons. > > Thanks, > Richard > > > > > Thanks, > > Richard. > > > > > > I can see cprop1 adds the REG_EQUAL note: > > > > (insn 22 21 23 4 (set (reg:V8HI 100) > > (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0) > > (parallel [ > > (const_int 0 [0]) > > (const_int 1 [0x1]) > > (const_int 2 [0x2]) > > (const_int 3 [0x3]) > > (const_int 4 [0x4]) > > (const_int 5 [0x5]) > > (const_int 6 [0x6]) > > (const_int 7 [0x7]) > > ])))) "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2} > > - (expr_list:REG_DEAD (reg:V4QI 98) > > - (nil))) > > + (expr_list:REG_EQUAL (const_vector:V8HI [ > > + (const_int 204 [0xcc]) repeated x8 > > + ]) > > + (expr_list:REG_DEAD (reg:V4QI 98) > > + (nil)))) > > > > but I don't see yet what the actual wrong transform based on this > > REG_EQUAL note is? > > > > It looks like we CSE the above with > > > > - 46: r122:V8QI=[`*.LC3'] > > - REG_EQUAL const_vector > > - 48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel)) > > - REG_EQUAL const_vector > > - REG_DEAD r122:V8QI > > - 49: r126:V8HI=r124:V8HI*r125:V8HI > > - REG_DEAD r125:V8HI > > + 49: r126:V8HI=r124:V8HI*r100:V8HI > > > > but otherwise do nothing. So the issue is that we rely on the "undefined" > > vals to have a specific value (from the earlier REG_EQUAL note) but actual > > code generation doesn't ensure this (it doesn't need to). That said, > > the issue isn't the constant folding per-se but that we do not actually > > constant fold but register an equality that doesn't hold. > > > > > > > >> Uros.