15 regression] Severe performance regression on insertion sort at -O2 or above

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 04 Jul 2024 00:21:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
                 CC|                            |rguenth at gcc dot gnu.org
             Target|X86_64                      |x86_64-*-*

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
it's the usual issue of very high cost of (scalar) load and store and in this
case very low cost on vector extract - that we even over-cost by a factor of
two because of the "duplicate" live stmt:

t.c:12:4: note: Costing subgraph:
t.c:12:4: note: node 0x51657c0 (max_nunits=2, refcnt=1) vector(2) unsigned int
t.c:12:4: note: op template: *_2 = _1;
t.c:12:4: note:         stmt 0 *_2 = _1;
t.c:12:4: note:         stmt 1 *_4 = _3;
t.c:12:4: note:         children 0x51658e0
t.c:12:4: note: node 0x51658e0 (max_nunits=1, refcnt=2) vector(2) unsigned int
t.c:12:4: note: op: VEC_PERM_EXPR
t.c:12:4: note:         [l] stmt 0 _1 = *_4;
t.c:12:4: note:         [l] stmt 1 _3 = *_2;
t.c:12:4: note:         lane permutation { 0[1] 0[0] }
t.c:12:4: note:         children 0x5165850 
t.c:12:4: note: node 0x5165850 (max_nunits=2, refcnt=1) vector(2) unsigned int
t.c:12:4: note: op template: _1 = *_4;
t.c:12:4: note:         [l] stmt 0 _3 = *_2;
t.c:12:4: note:         [l] stmt 1 _1 = *_4;

too bad that we end up using the lane extracted after the permute (that
gets us higher latency).

As with other BB vectorization opportunities it's difficult to estimate
the cost of tieing multiple independent data dependencies into a single
vector one and weight that against out-of-order independent execution.

To sum up, on the vectorizer side there's a bug costing the vector side
too much for the lane extracts (that makes the target cost issue even
more pronounced when fixed).

There are two problems with the vector code:

.L7:
        subq    $4, %rax
.L3:
        vmovq   (%rax), %xmm0
        vmovd   %xmm0, %edx
        vpextrd $1, %xmm0, %ecx
        cmpl    %edx, %ecx
        jnb     .L6
        vpshufd $225, %xmm0, %xmm0
        vmovq   %xmm0, (%rax)
        cmpq    %rdi, %rax
        jne     .L7

on zen4 the moves from vector to GPR are expensive.  But the most appearant
issue is that there's load-to-store forwarding conflicts with
storing 8 bytes to (%rax) with immediately loading 8 bytes from 4(%rax) in
the next iteration.  That's something the BB vectorizer does not consider
at all (it could look for data-ref evolution anticipating this in theory).

So trying to attack this from the cost modeling side is unlikely going
to cover this aspect of the issue.

Placing

#pragma GCC novector

before the inner loop is a workaround that works in GCC 14 and up.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/115777] [12/13/14/15 regression] Severe performance regression on insertion sort at -O2 or above

Reply via email to