https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069
--- Comment #19 from Xionghu Luo (luoxhu at gcc dot gnu.org) <yinyuefengyi at gmail dot com> --- (In reply to Xionghu Luo (luo...@gcc.gnu.org) from comment #15) > In combine: vec_select(vec_concat and the followed vec_select are combined > to a single extract instruction, which seems reasonable for both LE and BE? > > R146: 0 1 2 3 > R141: 4 5 6 7 > R150: 2 6 3 7 // vec_select(vec_concat(r146:V4SI,r141:V4SI),[2 6 3 7]) > R151: R150[3] // vec_select(r150:V4SI,3) > > => > > R151: R141[3] // vec_select(r141:V4SI,3) > > > > Trying 21 -> 24: > 21: r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel) > REG_DEAD r146:V4SI > REG_DEAD r141:V4SI > 24: {r151:SI=vec_select(r150:V4SI,parallel);clobber scratch;} > Failed to match this instruction: > (parallel [ > (set (reg:SI 151) > (vec_select:SI (reg:V4SI 141) > (parallel [ > (const_int 3 [0x3]) > ]))) > (clobber (scratch:SI)) > (set (reg:V4SI 150) > (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) > (reg:V4SI 141)) > (parallel [ > (const_int 2 [0x2]) > (const_int 6 [0x6]) > (const_int 3 [0x3]) > (const_int 7 [0x7]) > ]))) > ]) > Failed to match this instruction: > (parallel [ > (set (reg:SI 151) > (vec_select:SI (reg:V4SI 141) > (parallel [ > (const_int 3 [0x3]) > ]))) > (set (reg:V4SI 150) > (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) > (reg:V4SI 141)) > (parallel [ > (const_int 2 [0x2]) > (const_int 6 [0x6]) > (const_int 3 [0x3]) > (const_int 7 [0x7]) > ]))) > ]) > Successfully matched this instruction: > (set (reg:V4SI 150) > (vec_select:V4SI (vec_concat:V8SI (reg:V4SI 146) > (reg:V4SI 141)) > (parallel [ > (const_int 2 [0x2]) > (const_int 6 [0x6]) > (const_int 3 [0x3]) > (const_int 7 [0x7]) > ]))) > Successfully matched this instruction: > (set (reg:SI 151) > (vec_select:SI (reg:V4SI 141) > (parallel [ > (const_int 3 [0x3]) > ]))) > allowing combination of insns 21 and 24 > original costs 4 + 4 = 8 > replacement costs 4 + 4 = 8 > modifying insn i2 21: > r150:V4SI=vec_select(vec_concat(r146:V4SI,r141:V4SI),parallel) > REG_DEAD r146:V4SI > deferring rescan insn with uid = 21. > modifying insn i3 24: {r151:SI=vec_select(r141:V4SI,parallel);clobber > scratch;} > REG_DEAD r141:V4SI > deferring rescan insn with uid = 24. > > > I guess the previous unspec implementation bypassed the LE + LE swap check, > so now in split2, we should generate vextuwlx instead of vextuwrx on little > endian? This nested vec_select+vec_select+vec_concat optimization is introduced by Uros in simplify-rtx.c by PR32661, unfortunately it only works for Power BE platforms, disable that piece of code could work due to not combined the nested vec_select optimizations... For Power LE, firstly: Trying 21 -> 24: R146: 3 2 1 0 R141: 7 6 5 4 R150: 7 3 6 2 // vec_select(vec_concat(r146:V4SI,r141:V4SI),[2 6 3 7]) R151: R150[3] // vec_select(r150:V4SI,3) => currently: R151: R141[3] // vec_select(r141:V4SI,3) But it should be: R151: R146[3] // vec_select(r146:V4SI,3) Which means current: R151: R150[3] R141[3] R153: R150[2] R146[3] R155: R150[1] R141[2] R157: R150[0] R146[2] Should be optimized to after the first nested vec_select optimization: R151: R150[3] R146[3] R153: R150[2] R141[3] R155: R150[1] R146[2] R157: R150[0] R141[2] With some little endian check and swap could achieve the result (swap op00 and op01). But Secondly there is another "nested vec_select" optimisation which produces R151=R165[3]: Trying 21 -> 26: ... R146 R165 R163 [7 3 6 2] R151: R146[3] => R165[3] (this is wrong!) While R162, R163, R164, R165 is input value R0 R1 R2 R3. the vsx_extract_v4si_di_p9 index should be "0" instead of "3". correct should be: R151: R165[0] R153: R164[0] R155: R163[0] R157: R162[0] (insn 44 2 4 2 (set (reg:V4SI 162) (reg:V4SI 66 2 [ R0 ])) "q.C":36:1 1157 {vsx_movv4si_64bit} (expr_list:REG_DEAD (reg:V4SI 66 2 [ R0 ]) (nil))) (note 4 44 45 2 NOTE_INSN_DELETED) (insn 45 4 5 2 (set (reg:V4SI 163) (reg:V4SI 67 3 [ R1 ])) "q.C":36:1 1157 {vsx_movv4si_64bit} (expr_list:REG_DEAD (reg:V4SI 67 3 [ R1 ]) (nil))) (note 5 45 46 2 NOTE_INSN_DELETED) (insn 46 5 6 2 (set (reg:V4SI 164) (reg:V4SI 68 4 [ R2 ])) "q.C":36:1 1157 {vsx_movv4si_64bit} (expr_list:REG_DEAD (reg:V4SI 68 4 [ R2 ]) (nil))) (note 6 46 47 2 NOTE_INSN_DELETED) (insn 47 6 7 2 (set (reg:V4SI 165) (reg:V4SI 69 5 [ R3 ])) "q.C":36:1 1157 {vsx_movv4si_64bit} (expr_list:REG_DEAD (reg:V4SI 69 5 [ R3 ]) (nil))) ... (insn 33 32 34 2 (parallel [ (set (reg:DI 7 7) (zero_extend:DI (vec_select:SI (reg:V4SI 162) (parallel [ (const_int 3 [0x3]) ])))) (clobber (scratch:SI)) ]) "q.C":28:10 1396 {*vsx_extract_v4si_di_p9} (expr_list:REG_DEAD (reg:V4SI 162) (nil))) (insn 34 33 35 2 (parallel [ (set (reg:DI 6 6) (zero_extend:DI (vec_select:SI (reg:V4SI 163) (parallel [ (const_int 3 [0x3]) ])))) (clobber (scratch:SI)) ]) "q.C":28:10 1396 {*vsx_extract_v4si_di_p9} (expr_list:REG_DEAD (reg:V4SI 163) (nil))) (insn 35 34 36 2 (parallel [ (set (reg:DI 5 5) (zero_extend:DI (vec_select:SI (reg:V4SI 164) (parallel [ (const_int 3 [0x3]) ])))) (clobber (scratch:SI)) ]) "q.C":28:10 1396 {*vsx_extract_v4si_di_p9} (expr_list:REG_DEAD (reg:V4SI 164) (nil))) (insn 36 35 37 2 (parallel [ (set (reg:DI 4 4) (zero_extend:DI (vec_select:SI (reg:V4SI 165) (parallel [ (const_int 3 [0x3]) ])))) (clobber (scratch:SI)) ]) "q.C":28:10 1396 {*vsx_extract_v4si_di_p9} (expr_list:REG_DEAD (reg:V4SI 165) (nil))) But this is not easy to change the index again... Is the analysis reasonable? @Segher.