pr83008.c FAILs

rguenth at gcc dot gnu.org Wed, 30 Oct 2019 05:38:23 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92280


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com,
                   |                            |law at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> Sergey, your testcase now fails again.  I think there's two changes occuring,
> first we now vectorize the store to tmp[] from the first loop during
> basic-block vectorization as
> 
>   _586 = {_148, _142, _145, _139, _54, _58, _56, _60};
>   _588 = {_211, _217, _214, _220, _292, _298, _295, _301};
>   MEM <vector(8) unsigned int> [(unsigned int *)&tmp] = _588;
>   MEM <vector(8) unsigned int> [(unsigned int *)&tmp + 32B] = _586;
> 
> then we vectorize the second reduction loop after the fix for PR65930
> which then allows us to elide 'tmp' still visible in GIMPLE as
> 
>   vect__63.9_392 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp];
>   vect__64.12_388 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp +
> 16B];
>   vect__67.19_380 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp +
> 32B];
>   vect__68.22_376 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp +
> 48B];
> 
> so assembly has unvectorized first loop and then those latter vectors built
> via two times
> 
>         vmovd   %esi, %xmm3
>         vmovd   %esi, %xmm2
>         vmovd   %r11d, %xmm5
>         vmovd   %r15d, %xmm6
>         vpinsrd $1, %r13d, %xmm2, %xmm4
>         vpinsrd $1, %r14d, %xmm3, %xmm7
>         vpinsrd $1, %ebx, %xmm5, %xmm1
>         vpinsrd $1, %r9d, %xmm6, %xmm0
>         vpunpcklqdq     %xmm1, %xmm0, %xmm8
>         vpunpcklqdq     %xmm4, %xmm7, %xmm9
>         vinserti128     $0x1, %xmm9, %ymm8, %ymm10
> 
> note for the combined fix of PR65930 I see a 7% performance improvement
> for 525.x264_r on Haswell.
> 
> I think the original complaint in PR83008 was vectorization of the first
> loop which still does not happen, so the testcase needs adjustment?
> 
> There's also still GIMPLE improvements possible in eliding 'tmp' before
> RTL expansion.

Which VN can already do - it's just it doesn't have a global enough view
to decide profitability and so I chickened out enabling that.  Enabling
generates _much_ better code though, 73 lines of assembly compared to 264.
I guess it's as usual that RTL elimination of stack isn't very good.

That said, VN already computes the partial loads to { 148, _142, _145, _139 }
and would insert those CTORs in place of the loads, making the stores and
the AVX512 CTOR dead.  But that's obviously only profitable if the stores
and the CTOR end up being dead, otherwise we risk doing redundant
vector construction where cheap loads from memory would be possible.
The alternative way expressing it via sub-vector extraction is similarly
on the boundary of profitable plus we're happily simplifying that to a
redundant CTOR.

We currently do not elide 'tmp' because it doesn't fit a single register
so it is stack memory on RTL.  There CSE doesn't manage to simplify
this but combine manages to elide the loads but even postreload DSE cannot
elide the stack store for some reason.  That looks odd to me.  Moving
DSE2 up after combine helps a lot here.  I would guess since combine
can eliminate loads we definitely lack another DSE pass - alternatively
moving DSE1 down after combine might be another option, guess it's
there where it is because unrolling can expose dse/dce opportunities
though I don't see any CSE after unroll.

So shortest pass motion that helps this case:

Index: gcc/passes.def
===================================================================
--- gcc/passes.def      (revision 277608)
+++ gcc/passes.def      (working copy)
@@ -432,12 +432,12 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_web);
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_cse2);
-      NEXT_PASS (pass_rtl_dse1);
       NEXT_PASS (pass_rtl_fwprop_addr);
       NEXT_PASS (pass_inc_dec);
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_rtl_dse1);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);

[Bug target/92280] [10 regression] gcc.target/i386/pr83008.c FAILs

Reply via email to