https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037
Bug ID: 94037
Summary: Runtime varies 2x just by order of two int assignments
Product: gcc
Version: 9.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ncm at cantrip dot org
Target Milestone: ---
(This report re-uses some code from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 but identifies an entirely
different problem.)
Given:
```
bool swap_if(bool c, int& a, int& b) {
int v[2] = { a, b };
#ifdef FAST /* 4.6s */
b = v[1-c], a = v[c];
#else /* SLOW, 9.8s */
a = v[c], b = v[!c];
#endif
return c;
}
int* partition(int* begin, int* end) {
int pivot = end[-1];
int* left = begin;
for (int* right = begin; right < end - 1; ++right) {
left += swap_if(*right <= pivot, *left, *right);
}
int tmp = *left; *left = end[-1], end[-1] = tmp;
return left;
}
void quicksort(int* begin, int* end) {
while (end - begin > 1) {
int* mid = partition(begin, end);
quicksort(begin, mid);
begin = mid + 1;
} }
static const int size = 100'000'000;
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
int main(int, char**) {
int fd = ::open("1e8ints", O_RDONLY);
int perms = PROT_READ|PROT_WRITE;
int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE;
auto* a = (int*) ::mmap(nullptr, size * sizeof(int), perms, flags, fd, 0);
quicksort(a, a + size);
return a[0] == a[size - 1];
}
```
after
```
$ dd if=/dev/urandom of=1e8ints bs=1000000 count=400
```
The run time of the the program above, built "-O3 -march=skylake"
vs. "-DFAST -O3 -march=skylake", varies by 2x on Skylake, similarly
on Haswell. Both cases are almost equally fast on Clang, matching
G++'s fast version. The difference between "!c" and "1-c" in the
array index exacerbates the disparity.
Godbolt `<https://godbolt.org/z/w-buUF>` says, slow:
```
movl (%rax), %edx
movl (%rbx), %esi
movl %esi, 8(%rsp)
movl %edx, 12(%rsp)
cmpl %edx, %ecx
setge %sil
movzbl %sil, %esi
movl 8(%rsp,%rsi,4), %esi
movl %esi, (%rbx)
setl %sil
movzbl %sil, %esi
movl 8(%rsp,%rsi,4), %esi
movl %esi, (%rax)
```
and 2x as fast:
```
movl (%rax), %ecx
cmpl %ecx, %r8d
setge %dl
movzbl %dl, %edx
movl (%rbx), %esi
movl %esi, 8(%rsp)
movl %ecx, 12(%rsp)
movl %r9d, %esi
subl %edx, %esi
movslq %esi, %rsi
movl 8(%rsp,%rsi,4), %esi
movl %esi, (%rax)
movslq %edx, %rdx
movl 8(%rsp,%rdx,4), %edx
movl %edx, (%rbx)
cmpl %ecx, %r8d
```
Clang 9.0.0, -DFAST, for reference:
```
movl (%rcx), %r11d
xorl %edx, %edx
xorl %esi, %esi
cmpl %r8d, %r11d
setle %dl
setg %sil
movl (%rbx), %eax
movl %eax, (%rsp)
movl %r11d, 4(%rsp)
movl (%rsp,%rsi,4), %eax
movl %eax, (%rcx)
movl (%rsp,%rdx,4), %eax
movl %eax, (%rbx)
```