https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91981
Bug ID: 91981
Summary: Speed degradation because of inlining a register
clobbering function
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: antoshkka at gmail dot com
Target Milestone: ---
Consider the example that is a simplified version of
boost::container::small_vector:
#define MAKE_INLINING_BAD 1
struct vector {
int* data_;
int* capacity_;
int* size_;
void push_back(int v) {
if (capacity_ > size_) {
*size_ = v;
++size_;
} else {
reallocate_and_push(v);
}
}
void reallocate_and_push(int v)
#if MAKE_INLINING_BAD
{
// Just some code that clobbers many registers.
// You may skip reading it
const auto old_cap = capacity_ - data_;
const auto old_size = capacity_ - size_;
const auto new_cap = old_cap * 2 + 1;
auto new_data_1 = new int[new_cap];
auto new_data = new_data_1;
for (int* old_data = data_; old_data != size_; ++old_data, ++new_data)
{
*new_data = *old_data;
}
delete[] data_;
data_ = new_data_1;
size_ = new_data_1 + old_size;
capacity_ = new_data_1 + new_cap;
*size_ = v;
++size_;
}
#else
;
#endif
};
void bad_inlining(vector& v) {
v.push_back(42);
}
With `#define MAKE_INLINING_BAD 0` the generated code is quite good:
bad_inlining(vector&):
mov rax, QWORD PTR [rdi+16]
cmp QWORD PTR [rdi+8], rax
jbe .L2
mov DWORD PTR [rax], 42
add rax, 4
mov QWORD PTR [rdi+16], rax
ret
.L2:
mov esi, 42
jmp vector::reallocate_and_push(int)
However, with `#define MAKE_INLINING_BAD 1` the compiler decides to inline the
`reallocate_and_push` function that clobbers many registers. So the compiler
stores the values of those registers on the stack before doing the cmp+jbe:
bad_inlining(vector&):
push r13 ; don't need those for the `(capacity_ > size_)` case
push r12 ; likewise
push rbp ; likewise
push rbx ; likewise
mov rbx, rdi ; likewise
sub rsp, 8 ; likewise
mov rdx, QWORD PTR [rdi+8]
mov rax, QWORD PTR [rdi+16]
cmp rdx, rax
jbe .L2
mov DWORD PTR [rax], 42
add rax, 4
mov QWORD PTR [rdi+16], rax
add rsp, 8 ; don't need those for the `(capacity_ > size_)` case
pop rbx ; likewise
pop rbp ; likewise
pop r12 ; likewise
pop r13 ; likewise
ret
.L2:
; vector::reallocate_and_push(int) implementation goes here
This greatly degrades the performance of the first branch (more than x3
degradation in real code).
The possible fix would be to place all the push/pop operations near the inlined
`reallocate_and_push`:
bad_inlining(vector&):
mov rax, QWORD PTR [rdi+16]
cmp QWORD PTR [rdi+8], rax
jbe .L2
mov DWORD PTR [rax], 42
add rax, 4
mov QWORD PTR [rdi+16], rax
ret
.L2:
push r13
push r12
push rbp
push rbx
mov rbx, rdi
sub rsp, 8
; vector::reallocate_and_push(int) implementation goes here
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
ret
Godbolt playground: https://godbolt.org/z/oDutOd