https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
Bug ID: 92244
Summary: extra sub inside vectorized loop instead of
calculating end-pointer
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
We get a redundant instruction inside the vectorized loop here. But it's not a
separate *counter*, it's a duplicate of the tail pointer.
It goes away if we find tail with while(*tail++); instead of calculating it
from head+length.
Only happens with vectorization, not pure scalar (bug 92243 is about the fact
that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without
x86 SSSE3).
typedef char swapt;
void strrev_explicit(swapt *head, long len)
{
swapt *tail = head + len - 1;
for( ; head < tail; ++head, --tail) {
swapt h = *head, t = *tail;
*head = t;
*tail = h;
}
}
https://godbolt.org/z/wdGv4S
compiled with g++ -O3 -march=sandybridge gives us a main loop of
...
movq %rcx, %rsi # RSI = RCX before entering the loop
addq %rdi, %r8
.L4:
vmovdqu (%rcx), %xmm3 # tail load from RCX
addq $16, %rax # head
subq $16, %rcx # tail
subq $16, %rsi # 2nd tail?
vmovdqu -16(%rax), %xmm0
vpshufb %xmm2, %xmm3, %xmm1
vmovups %xmm1, -16(%rax)
vpshufb %xmm2, %xmm0, %xmm0
vmovups %xmm0, 16(%rsi) # tail store to RSI
cmpq %r8, %rax # } while(head != end_head)
jne .L4
RSI = RCX before and after the loop. This is obviously pointless.
head uses the same register for loads and stores.
Then we have bloated fully-unrolled scalar cleanup, instead of using the
shuffle control for 8-byte vectors -> movhps. Or scalar bswap. Ideally we'd
do something clever at the overlap like one load + shuffle + store, but we
might have to load the next vector before storing the current to make this work
at the overlap. That would presumably require more special-casing this kind of
meet-in-the-middle loop.
----
The implicit-length version doesn't have this extra sub in the main loop.
void strrev_implicit(swapt *head)
{
swapt *tail = head;
while(*tail) ++tail; // find the 0 terminator, like head+strlen
--tail; // tail points to the last real char
for( ; head < tail; ++head, --tail) {
swapt h = *head, t = *tail;
*head = t;
*tail = h;
}
}
.L22:
vmovdqu (%rcx), %xmm3
addq $16, %rdx # head
subq $16, %rcx # tail
vmovdqu -16(%rdx), %xmm0
vpshufb %xmm2, %xmm3, %xmm1
vmovups %xmm1, -16(%rdx)
vpshufb %xmm2, %xmm0, %xmm0
vmovups %xmm0, 16(%rcx)
cmpq %rsi, %rdx # } while(head != end_head)
jne .L22