https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082

            Bug ID: 123082
           Summary: missed optimization: large local array in recursive
                    function needlessly kept alive across recursive calls
           Product: gcc
           Version: 15.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: skywave2023 at outlook dot com
  Target Milestone: ---

Steps to reproduce
------------------

1. Compile the following test case with GCC at a high optimization level, e.g.:

   g++ -std=gnu++23 -Ofast -S mgsort.cc -o mgsort.s

   Tested with: g++ (GCC) 15.2.0 on x86_64-pc-linux-gnu.

2. Inspect the generated assembly for mgsort(int, int).

Test case (Compiler Explorer: https://godbolt.org/z/WYarKTTjW)
---------

#include <algorithm>
#include <cstring>

constexpr int N = 100000 + 1;

int a[N];

void mgsort(int l, int r) {
    if (l == r) {
        return;
    }
    int mid = (l + r) >> 1;
    mgsort(l, mid);
    mgsort(mid + 1, r);

    int b[N];
    std::merge(a + l, a + mid + 1, a + mid + 1, a + r + 1, b + l);
    std::memcpy(a + l, b + l, sizeof(int) * (r - l + 1));
}

(Any similar testcase where a large local array is only used after both
recursive calls should be equivalent.)

Actual results
--------------

For non-leaf calls, GCC allocates the full `int b[N]` array in the prologue of
`mgsort` before making any recursive calls, and keeps this stack space reserved
until the function returns.

On x86_64, the prologue looks like (simplified):

    cmp     edi, esi
    je      .Lbase_case
    push    r15
    push    r14
    push    r13
    push    r12
    push    rbp
    push    rbx
    sub     rsp, 400040          # reserve ~400k for local array b[N]
    call    mgsort               # recursive call (left half)
    call    mgsort               # recursive call (right half)
    ...
    lea     r13, [rsp+16+rdx]    # use rsp-based region as b + l
    ...                          # merge + memcpy
    add     rsp, 400040
    pop     rbx
    pop     rbp
    pop     r12
    pop     r13
    pop     r14
    pop     r15
    ret

So each recursive frame keeps its own full `b[N]` array on the stack even
though `b` is only accessed *after* both recursive calls return.

With N = 100000 + 1, sizeof(b) is about 400 kB.  The recursion depth of this
merge-sort is O(log N) (about 17 for this N), so just this single local array
accounts for roughly depth * sizeof(b) ≈ 6–7 MB of stack usage, even though
there is never any need to have more than one `b` “live” at a time.

Expected results
----------------

Since the address of `b` never escapes and is only passed to non-recursive
library functions (`std::merge` and `memcpy`), and no pointer to `b` is passed
into the recursive calls of `mgsort`, there is no observable way for a
recursive call to access its caller's `b`.

This suggests that GCC could legally shrink the *actual lifetime of the
storage* for `b` so that it does not overlap the recursive calls, for example
by:

  * moving the stack allocation for `b` (the `sub rsp, ...`) from the function
prologue to just before the merge+memcpy code, and
  * freeing it (the `add rsp, ...`) immediately after `memcpy` returns, before
the function returns to its caller.

With such an optimization, at run time there would be at most one active
instance of `b` on the stack at any given moment: when a deeper recursive call
allocates its own `b`, the callers would not yet have allocated theirs (because
they only allocate `b` *after* their recursive calls have finished).

This would reduce the stack usage due to `b` from approximately `depth *
sizeof(b)` bytes to roughly `1 * sizeof(b) + O(depth)` bytes, without changing
the observable behavior of the program, as far as I can tell.

In other words, this looks like a missed opportunity to use lifetime analysis
to avoid keeping large local arrays alive across recursive calls.

Additional information
----------------------

* The same behavior is observed with both -O3 and -Ofast.
* The option -fstack-reuse=all is already the default and does not address this
issue, because it only controls reuse of stack slots for local variables and
temporaries *within the same frame*, not across different recursive
activations.  So there seems to be no command-line option that enables this
optimization.  (See the documentation of -fstack-reuse.)
* From the source-level code, the first use of `b` is strictly after both
recursive calls, and no pointer to `b` is stored anywhere that could be used by
a recursive call.  So it seems to be safe to reorder the actual allocation of
storage for `b` as described above, under the “as-if” rule.

If this reasoning is correct, I would expect GCC to eventually be able to
shrink the lifetime of such large local arrays in recursive functions, or at
least consider this as an optimization opportunity.

Reply via email to