Hi, The below (slightly) reduced test-case from gcc.c-torture/execute/simd-2.c produces an execution failure for avr.
typedef short __attribute__((vector_size (16))) vecint; vecint i = { 150, 100, 150, 200, 0, 0, 0, 0 }; vecint j = { 10, 13, 20, 30, 1, 1, 1, 1 }; vecint k; union { vecint v; short i[8]; } res; void verify (int a1, int a2, int a3, int a4, int b1, int b2, int b3, int b4) { if (a1 != b1 || a2 != b2 || a3 != b3 || a4 != b4) abort (); } int main () { k = i + j; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], 160, 113, 170, 230); k = i * j; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], 1500, 1300, 3000, 6000); k = i / j; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], 15, 7, 7, 6); k = i & j; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], 2, 4, 20, 8); k = i | j; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], 158, 109, 150, 222); k = i ^ j; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], 156, 105, 130, 214); k = -i; res.v = k; verify (res.i[0], res.i[1], res.i[2], res.i[3], -150, -100, -150, -200); exit (0); } To reproduce, enable LRA in avr.c by removing TARGET_LRA_P hook, build with make all-host && make install-host, and compile $ avr-gcc -mmcu=avr51 ~/code/personal/gcc/gcc/testsuite/gcc.c-torture/execute/simd-2.c -Os -S -o out.s -fdump-rtl-all-raw The pass before dse2 has the below instructions to store the lowest two bytes of i (i.e 150) at stack offsets 23 and 21. (insn 2403 2402 2404 2 (parallel [ (set (reg:QI 24 r24 [orig:102 _131 ] [102]) (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("i") [flags 0x2] <var_decl 0x7ff45a778090 i>) (const_int 1 [0x1]))) [1 i+1 S1 A8])) (clobber (reg:CC 36 cc)) ]) 89 {movqi_insn} (nil)) (insn 2404 2403 2405 2 (parallel [ (set (mem/c:QI (plus:HI (reg/f:HI 28 r28) (const_int 23 [0x17])) [4 %sfp+23 S1 A8]) (reg:QI 24 r24 [orig:102 _131 ] [102])) (clobber (reg:CC 36 cc)) ]) 89 {movqi_insn} (nil)) (insn 2405 2404 2406 2 (parallel [ (set (reg:QI 24 r24 [orig:99 _128 ] [99]) (mem/c:QI (symbol_ref:HI ("i") [flags 0x2] <var_decl 0x7ff45a778090 i>) [1 i+0 S1 A128])) (clobber (reg:CC 36 cc)) ]) 89 {movqi_insn} (nil)) (insn 2406 2405 2407 2 (parallel [ (set (mem/c:QI (plus:HI (reg/f:HI 28 r28) (const_int 21 [0x15])) [4 %sfp+21 S1 A8]) (reg:QI 24 r24 [orig:99 _128 ] [99])) (clobber (reg:CC 36 cc)) ]) 89 {movqi_insn} (nil)) dse2 deletes these insns, as it sees the below clobber before the values in the stack offsets are used. (insn 448 2083 2084 15 (clobber (mem/c:TI (plus:HI (reg/f:HI 28 r28) (const_int 9 [0x9])) [4 %sfp+9 S16 A8])) -1 (nil)) DSE figures everything from stack offset 9 to 9 + 16 = 25 (16 is GET_MODE_SIZE (TImode)) is clobbered, so it deletes the dead stores. The reason for the dead stores is overlapping stack offsets for two spilled pseudos. IRA has <snip> (insn 34 32 36 2 (set (reg:QI 102 [ _131 ]) (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("i") [flags 0x2] <var_decl 0x7ff45a778090 i>) (const_int 1 [0x1]))) [1 i+1 S1 A8])) 86 {movqi_insn_split} (nil)) (insn 36 34 38 2 (set (reg:QI 99 [ _128 ]) (mem/c:QI (symbol_ref:HI ("i") [flags 0x2] <var_decl 0x7ff45a778090 i>) [1 i+0 S1 A128])) 86 {movqi_insn_split} (nil)) <snip> (insn 448 430 432 15 (clobber (reg:TI 147 [ _176 ])) -1 (nil)) and LRA spills 102 and 99 to distinct stack slots, like so Slot 10 regnos (width = 0): 99 Slot 12 regnos (width = 0): 102 However, when spilling 147, Breakpoint 1, add_pseudo_to_slot (regno=147, slot_num=4) at /home/i41766/code/personal/gcc/gcc/lra-spills.cc:331 331 lra_reg_info[regno].biggest_mode); (gdb) p slots[slot_num] $1 = {regno = 51, hard_regno = -1, align = 8, size = {<poly_int_pod<1, long>> = {coeffs = {2}}, <No data fields>}, mem = 0x7fffea5eb1c8, live_ranges = 0x33b29c0} (gdb) p debug_rtx(slots[slot_num].mem) (mem/c:BLK (plus:HI (reg/f:HI 28 r28) (const_int 9 [0x9])) [0 A8]) it reuses slot 4, originally 2 bytes wide and with stack offset 9. Live ranges for r51 and r147 do not conflict, so this ok, and it resizes the stack slot to be as wide as TImode. However, it does not reset slots[slot_num].mem, and assign_mem_slot when called for r147 sets the same stack offset rtx as pseudo_slots[147].mem. This appears wrong to me, as the bigger slot may conflict with stack offsets assigned to other slots. The below patch resets slots[slot_num].mem if the resized slot is bigger than before. assign_mem_slot later gets a new stack offset from function.cc:assign_stack_local and sets it as the slot's MEM rtx. diff --git a/gcc/lra-spills.cc b/gcc/lra-spills.cc index a663a1931e3..82d5fbef66a 100644 --- a/gcc/lra-spills.cc +++ b/gcc/lra-spills.cc @@ -331,9 +331,14 @@ add_pseudo_to_slot (int regno, int slot_num) lra_reg_info[regno].biggest_mode); unsigned int align = spill_slot_alignment (mode); slots[slot_num].align = MAX (slots[slot_num].align, align); - slots[slot_num].size = upper_bound (slots[slot_num].size, + + poly_int64 origsize = slots[slot_num].size; + slots[slot_num].size = upper_bound (origsize, GET_MODE_SIZE (mode)); + if (maybe_gt(slots[slot_num].size, origsize)) + slots[slot_num].mem = NULL_RTX; + if (slots[slot_num].regno < 0) { /* It is the first pseudo in the slot. */ After the patch, the slot for r147 gets a different non-overlapping stack offset, and DSE no longer removes the storing of the lowest 2 bytes of i. (insn 448 430 1436 43 (clobber (mem/c:TI (plus:HI (reg/f:HI 28 r28) (const_int 55 [0x37])) [4 %sfp+55 S16 A8])) -1 (nil)) However, I'm not sure what should be done for the other pseudo(s) sharing the slot (r51 in this example). Their pseudo_slot.mem will still be the original stack offset rtx. It appears to work ok for this test case, but I'm not sure if that is right thing to do. Is my understanding correct, and is there a better way to fix this problem? Regards Senthil