https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92280

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sergey.shalnov at intel dot com

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Sergey, your testcase now fails again.  I think there's two changes occuring,
first we now vectorize the store to tmp[] from the first loop during
basic-block vectorization as

  _586 = {_148, _142, _145, _139, _54, _58, _56, _60};
  _588 = {_211, _217, _214, _220, _292, _298, _295, _301};
  MEM <vector(8) unsigned int> [(unsigned int *)&tmp] = _588;
  MEM <vector(8) unsigned int> [(unsigned int *)&tmp + 32B] = _586;

then we vectorize the second reduction loop after the fix for PR65930
which then allows us to elide 'tmp' still visible in GIMPLE as

  vect__63.9_392 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp];
  vect__64.12_388 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B];
  vect__67.19_380 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 32B];
  vect__68.22_376 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B];

so assembly has unvectorized first loop and then those latter vectors built
via two times

        vmovd   %esi, %xmm3
        vmovd   %esi, %xmm2
        vmovd   %r11d, %xmm5
        vmovd   %r15d, %xmm6
        vpinsrd $1, %r13d, %xmm2, %xmm4
        vpinsrd $1, %r14d, %xmm3, %xmm7
        vpinsrd $1, %ebx, %xmm5, %xmm1
        vpinsrd $1, %r9d, %xmm6, %xmm0
        vpunpcklqdq     %xmm1, %xmm0, %xmm8
        vpunpcklqdq     %xmm4, %xmm7, %xmm9
        vinserti128     $0x1, %xmm9, %ymm8, %ymm10

note for the combined fix of PR65930 I see a 7% performance improvement
for 525.x264_r on Haswell.

I think the original complaint in PR83008 was vectorization of the first
loop which still does not happen, so the testcase needs adjustment?

There's also still GIMPLE improvements possible in eliding 'tmp' before
RTL expansion.

Reply via email to