https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92280
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hjl.tools at gmail dot com, | |law at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #2) > Sergey, your testcase now fails again. I think there's two changes occuring, > first we now vectorize the store to tmp[] from the first loop during > basic-block vectorization as > > _586 = {_148, _142, _145, _139, _54, _58, _56, _60}; > _588 = {_211, _217, _214, _220, _292, _298, _295, _301}; > MEM <vector(8) unsigned int> [(unsigned int *)&tmp] = _588; > MEM <vector(8) unsigned int> [(unsigned int *)&tmp + 32B] = _586; > > then we vectorize the second reduction loop after the fix for PR65930 > which then allows us to elide 'tmp' still visible in GIMPLE as > > vect__63.9_392 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp]; > vect__64.12_388 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + > 16B]; > vect__67.19_380 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + > 32B]; > vect__68.22_376 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + > 48B]; > > so assembly has unvectorized first loop and then those latter vectors built > via two times > > vmovd %esi, %xmm3 > vmovd %esi, %xmm2 > vmovd %r11d, %xmm5 > vmovd %r15d, %xmm6 > vpinsrd $1, %r13d, %xmm2, %xmm4 > vpinsrd $1, %r14d, %xmm3, %xmm7 > vpinsrd $1, %ebx, %xmm5, %xmm1 > vpinsrd $1, %r9d, %xmm6, %xmm0 > vpunpcklqdq %xmm1, %xmm0, %xmm8 > vpunpcklqdq %xmm4, %xmm7, %xmm9 > vinserti128 $0x1, %xmm9, %ymm8, %ymm10 > > note for the combined fix of PR65930 I see a 7% performance improvement > for 525.x264_r on Haswell. > > I think the original complaint in PR83008 was vectorization of the first > loop which still does not happen, so the testcase needs adjustment? > > There's also still GIMPLE improvements possible in eliding 'tmp' before > RTL expansion. Which VN can already do - it's just it doesn't have a global enough view to decide profitability and so I chickened out enabling that. Enabling generates _much_ better code though, 73 lines of assembly compared to 264. I guess it's as usual that RTL elimination of stack isn't very good. That said, VN already computes the partial loads to { 148, _142, _145, _139 } and would insert those CTORs in place of the loads, making the stores and the AVX512 CTOR dead. But that's obviously only profitable if the stores and the CTOR end up being dead, otherwise we risk doing redundant vector construction where cheap loads from memory would be possible. The alternative way expressing it via sub-vector extraction is similarly on the boundary of profitable plus we're happily simplifying that to a redundant CTOR. We currently do not elide 'tmp' because it doesn't fit a single register so it is stack memory on RTL. There CSE doesn't manage to simplify this but combine manages to elide the loads but even postreload DSE cannot elide the stack store for some reason. That looks odd to me. Moving DSE2 up after combine helps a lot here. I would guess since combine can eliminate loads we definitely lack another DSE pass - alternatively moving DSE1 down after combine might be another option, guess it's there where it is because unrolling can expose dse/dce opportunities though I don't see any CSE after unroll. So shortest pass motion that helps this case: Index: gcc/passes.def =================================================================== --- gcc/passes.def (revision 277608) +++ gcc/passes.def (working copy) @@ -432,12 +432,12 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_web); NEXT_PASS (pass_rtl_cprop); NEXT_PASS (pass_cse2); - NEXT_PASS (pass_rtl_dse1); NEXT_PASS (pass_rtl_fwprop_addr); NEXT_PASS (pass_inc_dec); NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); NEXT_PASS (pass_combine); + NEXT_PASS (pass_rtl_dse1); NEXT_PASS (pass_if_after_combine); NEXT_PASS (pass_jump_after_combine); NEXT_PASS (pass_partition_blocks);