| Issue |
171606
|
| Summary |
Missed optimization: Clang keeps large fixed local array alive across recursive calls, wasting stack space
|
| Labels |
clang
|
| Assignees |
|
| Reporter |
SkyWaveSun
|
### Summary
Clang keeps a large fixed-size local array in the stack frame of a recursive function alive across recursive calls, even though the array is only used after both recursive calls return and its address does not escape into them.
This leads to stack usage of roughly `depth * sizeof(array)` instead of something close to `sizeof(array) + O(depth)` for a classic mergesort implementation, and seems to be a missed optimization in Clang/LLVM's lifetime analysis / stack allocation.
### Environment
* Clang version: 21.1.0
* Platform: x86_64 (as used by Compiler Explorer / godbolt.org)
* Command line used on Compiler Explorer:
```sh
clang++ -std=c++23 -O3 -S mgsort.cpp -o mgsort.s
```
(On Compiler Explorer this corresponds to choosing Clang and passing `-std=c++23 -O3`.)
* Reproducer link (Compiler Explorer): [https://godbolt.org/z/9bG6jGqjj](https://godbolt.org/z/9bG6jGqjj)
### Test case
```c++
#include <algorithm>
#include <cstring>
constexpr int N = 100000 + 1;
int a[N];
void mgsort(int l, int r) {
if (l == r) {
return ;
}
int mid = (l + r) >> 1;
mgsort(l, mid);
mgsort(mid + 1, r);
int b[N];
std::merge(a + l, a + mid + 1, a + mid + 1, a + r + 1, b + l);
std::memcpy(a + l, b + l, sizeof(int) * (r - l + 1));
}
```
This is intended to be standard C++ (no VLA).
### Actual behavior
For non‑leaf calls, Clang allocates the full `int b[N]` array in the prologue of `mgsort` before making any recursive calls and keeps this stack space reserved until the function returns.
On x86_64 with `-O3`, the generated assembly looks like (simplified):
```asm
mgsort(int, int):
mov eax, esi
sub eax, edi
je .LBB0_14 # base case: l == r
push rbp
push r15
push r14
push r13
push r12
push rbx
sub rsp, 400024 # reserve ~400k of stack for b[N]
... compute mid, save l / r ...
call mgsort # recursive call on left half
...
call mgsort # recursive call on right half
... later, compute the address of b + l on the stack ...
lea r14, [rsp + 4*rax]
add r14, 16 # r14 = b + l (b is based at rsp+16)
... merge + memcpy ...
add rsp, 400024
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
.LBB0_14:
ret
```
So every recursive frame reserves a full `b[N]` array on its stack frame, and this space is live across the recursive calls. With `N = 100000 + 1`, `sizeof(b)` is about 400 kB; the recursion depth is `O(log N)` (around 17 here), so the total stack consumed by this single variable is roughly `17 * 400kB ≈ 6–7 MB`.
However, `b` is only accessed after both recursive calls have returned, and there is no way for a recursive call to access its caller's `b`.
### Expected behavior
Since the first use of `b` is strictly after both recursive calls, and no pointer to `b` is passed into those calls (nor stored anywhere that
they could access), it should be legal for the compiler to shrink the allocation lifetime of `b` so that its storage does not overlap the recursive calls.
Conceptually, the compiler could:
* move the actual stack allocation for `b` from the function prologue to just before the `std::merge` / `memcpy` code, and
* free it immediately after `memcpy` returns.
With such an optimization, at run time there would only be a single active instance of `b` on the stack at any time, reducing stack usage from approximately `depth * sizeof(b)` to about `sizeof(b) + O(depth)`.
This appears to be a missed optimization in Clang/LLVM's handling of large fixed-size local arrays in recursive functions.
### Additional notes
* The issue reproduces for me on Compiler Explorer (see the link above) with `-std=c++23 -O3`. It also appears with other high optimization levels (e.g. `-O3` vs `-Ofast`) where stack allocation is similar.
* A related GCC bug report with the same test case is: [https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082)
In that report, GCC developers classify this as a `middle-end` `missed-optimization` and suggest a workaround using a GNU C++ VLA extension. The intention here is similar: to see whether LLVM/Clang could perform a comparable lifetime-shrinking optimization for the standard C++ fixed-array case, without relying on extensions.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs