https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92280
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sergey.shalnov at intel dot com --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Sergey, your testcase now fails again. I think there's two changes occuring, first we now vectorize the store to tmp[] from the first loop during basic-block vectorization as _586 = {_148, _142, _145, _139, _54, _58, _56, _60}; _588 = {_211, _217, _214, _220, _292, _298, _295, _301}; MEM <vector(8) unsigned int> [(unsigned int *)&tmp] = _588; MEM <vector(8) unsigned int> [(unsigned int *)&tmp + 32B] = _586; then we vectorize the second reduction loop after the fix for PR65930 which then allows us to elide 'tmp' still visible in GIMPLE as vect__63.9_392 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp]; vect__64.12_388 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 16B]; vect__67.19_380 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 32B]; vect__68.22_376 = MEM <vector(4) unsigned int> [(unsigned int *)&tmp + 48B]; so assembly has unvectorized first loop and then those latter vectors built via two times vmovd %esi, %xmm3 vmovd %esi, %xmm2 vmovd %r11d, %xmm5 vmovd %r15d, %xmm6 vpinsrd $1, %r13d, %xmm2, %xmm4 vpinsrd $1, %r14d, %xmm3, %xmm7 vpinsrd $1, %ebx, %xmm5, %xmm1 vpinsrd $1, %r9d, %xmm6, %xmm0 vpunpcklqdq %xmm1, %xmm0, %xmm8 vpunpcklqdq %xmm4, %xmm7, %xmm9 vinserti128 $0x1, %xmm9, %ymm8, %ymm10 note for the combined fix of PR65930 I see a 7% performance improvement for 525.x264_r on Haswell. I think the original complaint in PR83008 was vectorization of the first loop which still does not happen, so the testcase needs adjustment? There's also still GIMPLE improvements possible in eliding 'tmp' before RTL expansion.