https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |53947 CC| |rguenth at gcc dot gnu.org Target|X86_64 |x86_64-*-* --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- it's the usual issue of very high cost of (scalar) load and store and in this case very low cost on vector extract - that we even over-cost by a factor of two because of the "duplicate" live stmt: t.c:12:4: note: Costing subgraph: t.c:12:4: note: node 0x51657c0 (max_nunits=2, refcnt=1) vector(2) unsigned int t.c:12:4: note: op template: *_2 = _1; t.c:12:4: note: stmt 0 *_2 = _1; t.c:12:4: note: stmt 1 *_4 = _3; t.c:12:4: note: children 0x51658e0 t.c:12:4: note: node 0x51658e0 (max_nunits=1, refcnt=2) vector(2) unsigned int t.c:12:4: note: op: VEC_PERM_EXPR t.c:12:4: note: [l] stmt 0 _1 = *_4; t.c:12:4: note: [l] stmt 1 _3 = *_2; t.c:12:4: note: lane permutation { 0[1] 0[0] } t.c:12:4: note: children 0x5165850 t.c:12:4: note: node 0x5165850 (max_nunits=2, refcnt=1) vector(2) unsigned int t.c:12:4: note: op template: _1 = *_4; t.c:12:4: note: [l] stmt 0 _3 = *_2; t.c:12:4: note: [l] stmt 1 _1 = *_4; too bad that we end up using the lane extracted after the permute (that gets us higher latency). As with other BB vectorization opportunities it's difficult to estimate the cost of tieing multiple independent data dependencies into a single vector one and weight that against out-of-order independent execution. To sum up, on the vectorizer side there's a bug costing the vector side too much for the lane extracts (that makes the target cost issue even more pronounced when fixed). There are two problems with the vector code: .L7: subq $4, %rax .L3: vmovq (%rax), %xmm0 vmovd %xmm0, %edx vpextrd $1, %xmm0, %ecx cmpl %edx, %ecx jnb .L6 vpshufd $225, %xmm0, %xmm0 vmovq %xmm0, (%rax) cmpq %rdi, %rax jne .L7 on zen4 the moves from vector to GPR are expensive. But the most appearant issue is that there's load-to-store forwarding conflicts with storing 8 bytes to (%rax) with immediately loading 8 bytes from 4(%rax) in the next iteration. That's something the BB vectorizer does not consider at all (it could look for data-ref evolution anticipating this in theory). So trying to attack this from the cost modeling side is unlikely going to cover this aspect of the issue. Placing #pragma GCC novector before the inner loop is a workaround that works in GCC 14 and up. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations