On 30. 5. 2026 16:54, Branko Čibej wrote:
On 29. 5. 2026 12:53, Branko Čibej wrote:
On 29. 5. 2026 12:16, Branko Čibej wrote:
On 29. 5. 2026 08:50, Branko Čibej wrote:
On 29. 5. 2026 00:17, Jun Omae wrote:
Hi,

On 2026/05/29 4:37, Branko Čibej wrote:
This intermittent failure shows up sometimes on both Linux and macOS:

FAIL:  ra-test: Unknown test failure (-11); see tests.log.

That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in 
repos-test, too, and it doesn't matter if its local or DAV or svnserve. I 
suspect that we have a wonderful little bug in our code. I've been trying, on 
and off, to trace this down either with a debugger or with clang's address 
sanitiser or with APR's pool debugging enabled, but it's an elusive little 
bxtrd and I've not yet been able to track it down. Not even so far as to figure 
out whether it's hiding in FSFS or somewhere else.

It's been a few years and I think I remember seeing it on the 1.14 branch, too. 
On the other hand, I don't recall any bug reports that could be linked to this 
observation. We're seeing this failure in the CI tests, too, at least with 
autotools. I haven't seen this with CMake, but failures in those workflows are 
far more often related to vcpkg or other environmental stuff, so it's a bit 
hard to find.

If anyone has any idea where to look without diving into a line-by-line review 
of the code, please share your thoughts. This is starting to get on my nerves, 
just a little bit.

The issue occurs in serf_default_destroy_and_data() via 
apr_pool_cleanup_for_exec() from the child process after apr_proc_create() to 
open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or serf 
have something wrong. Also, I guess that the same issue might occur when a hook 
is invoked with SVNMasterURI enabled on Apache mpm event or worker.

It happens without DAV, too, and with prefork mpm – the only one that even works on macOS, so I always use that for testing.


This is what I caught on macOS. It was davautocheck, but Serf is not involved:

(lldb) target create "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test" 
--core "/cores/core.9605"
Core file '/cores/core.9605' (arm64) was loaded.
(lldb) bt
* thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
   * frame #0: 0x0000000104a83524 
ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
     frame #1: 0x0000000104d6b5f0 
libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at 
client.c:597:5
     frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88
     frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68
     frame #4: 0x0000000104b353a0 
libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0, 
corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000, 
repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000, 
callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000, 
config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
     frame #5: 0x0000000104a7992c 
ra-test`tunnel_callback_test(opts=0x000000016b3826b8, pool=0x000000072d03c028) 
at ra-test.c:469:3
     frame #6: 0x0000000104aba8a0 
libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150, 
data=0x000000016b382570) at svn_test_main.c:577:15
     frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 136


Note that this test runs against svnserve with just 'check' too, no DAV involved at all. So it's either caused by something in libsvn_ra_svn, or more likely, in ra-test itself. Those tunnels are notoriously tricky, I remember we had trouble with them before in JavaHL, too.

Finally got some interesting info from the core file – it takes some work to get core dumps on the Mac, setting ulimit -c isn't enough.

frame #0: 0x0000000104a83524 
ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0, 
tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
    326                 apr_proc_wait(b->proc, &child_exit_code, 
&child_exit_why, APR_WAIT);
    327         
    328               SVN_TEST_ASSERT_NO_RETURN(child_exit_status == 
APR_CHILD_DONE);
-> 329             SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
    330               SVN_TEST_ASSERT_NO_RETURN(child_exit_why == 
APR_PROC_EXIT);
    331             }
    332         }
(lldb) p child_exit_status
(apr_status_t) 70005
(lldb) p child_exit_code
(int) 10
(lldb) p child_exit_why
(apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE


That means svnserve crashed with a SIGBUS. I'll keep investigating.

The SIGBUS is real, but it's not in svnserve – it's in the forked ra-test and happens before the exec() of svnserve. I had to build apr and apr-util with debugging enabled to find this, and here's the relevant part of the stack. For context: On Unix, apr_proc_create() forks the parent then runs cleanups on the global pool before exec()'ing the real child.


(lldb) down
frame #3: 0x0000000101114b10 
libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at 
apr_pools.c:2702:9
    2699            run_child_cleanups(&p->cleanups);
    2700        
    2701            for (p = p->child; p; p = p->sibling)
-> 2702              cleanup_pool_for_exec(p);
    2703        }
    2704        
    2705        APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
(lldb) p *p
(apr_pool_t) {
   parent = 0x000000087ac44028
   child = 0x000000087a844028
   sibling = 0x000000087ac94028
   ref = 0x000000087ac44030
   cleanups = NULL
   free_cleanups = 0x000000087ac8e7e8
   allocator = 0x0000000101500b40
   subprocesses = NULL
   abort_fn = 0x0000000101018d94 (libsvn_subr-1.0.dylib`abort_on_pool_failure 
at pool.c:53)
   user_data = NULL
   tag = 0x0000000000000000
   active = 0x000000087a800000
   self = 0x000000087ac8e000
   self_first_avail = 0x000000087ac8e0a0 "file"
   pre_cleanups = NULL
}
(lldb) down
frame #2: 0x0000000101114b10 
libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at 
apr_pools.c:2702:9
    2699            run_child_cleanups(&p->cleanups);
    2700        
    2701            for (p = p->child; p; p = p->sibling)
-> 2702              cleanup_pool_for_exec(p);
    2703        }
    2704        
    2705        APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
(lldb) p *p
(apr_pool_t) {
   parent = 0x0000000000000a35
   child = 0x0000000000000011
   sibling = 0x0000000000000002
   ref = 0x0000000060232b75
   cleanups = 0x0000000000000001
   free_cleanups = 0x0000000000000003
   allocator = 0x0000000000000011
   subprocesses = 0x0000000000000059
   abort_fn = 0x0000000000000005
   user_data = 0x00000000403dbe48
   tag = 0x0000000000000001 ""
   active = 0x0000000000000002
   self = 0x000000000000006a
   self_first_avail = 0x0000000000000001 ""
   pre_cleanups = 0x0000000000000006
}

Note he pool structure from frame #2: it's a clear case of memory corruption. Most of the pointers there are invalid, some are also misaligned. Calling run_child_cleanups(&p->cleanups) triggers an immediate SIGBUS because cleanups is misaligned. The fun part now will be trying to find out if this is a bug in APR pools – unlikely – or a memory corruption bug in ra-test or libsvn_ra – also unlikely but it has to be one of these.

What finally pointed me in the right direction is that for every crash of ra-test, /two/ core files were created. One from the parent which aborted because the child's exit code was non-zero, and one from the child. Well, at least it won't be boring.


After spending way too much time reviewing APR code and concluding that APR's pools aren't to blame; I created a small test program that exercises just the process creation part of the whole – see attached. Predictably, it works fine. I also spent time running 'ra-test --parallel' in a Linux VM (UTM aarch64, so there's no CPU emulation, only Apple's virtualisation layer) and I couldn't reproduce the problem. I could consistently reproduce it on the mac in similar circumstances.

I'm now left with only one possible avenue of exploration: that the child cleanup functions that are run after fork() but before exec() in the child process are somehow corrupting the pool structure, perhaps by using syscalls that are not safe on macOS but are safe, for example, on Linux. Ideally I'd try this test on FreeBSD and/or OpenBSD to see if it is somehow related to the BSD vs. Linux divide (macOS is Darwin is BSD), but a) I don't have Subversion build environments on those VMs and b) I'd not be surprised if Darwin and the other BSDs have drifted far enough apart that this isn't a valid test. If for no other reason than their kernels are completely different.

Any help or additional insight would be much appreciated at this point. Not, not from you, you upstart LLM who're wrong 70% of the time. :(

-- Brane
APR_CONFIG = apr-1-config

macos_fork_exec: macos_fork_exec.c
        $$($(APR_CONFIG) --cc) $$($(APR_CONFIG) --cppflags --includes --cflags) 
\
        $< -o $@ $$($(APR_CONFIG) --ldflags --link-ld --libs)

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

#include <apr.h>
#include <apr_pools.h>
#include <apr_strings.h>
#include <apr_thread_mutex.h>
#include <apr_thread_proc.h>
#include <apr_time.h>


__attribute__((noreturn))
static void print_args(int argc, const char *argv[])
{
    while (argc > 1)
    {
        /* printf("%s\n", argv[1]); */
        --argc;
        ++argv;
    }
    exit(0);
}

static void checked(const char *message, apr_status_t status)
{
    if (status != APR_SUCCESS)
    {
        fprintf(stderr, "%s status %ld", message, (long)status);
        abort();
    }
}

static apr_status_t pool_abort(apr_status_t status)
{
    fprintf(stderr, "allocation %ld", (long)status);
    abort();
    return APR_EGENERAL;
}


typedef struct thread_data
{
    apr_pool_t *thread_pool;
    const char *thread_name;
    const char *program_name;
} thread_data_t;


static void *thread_func(apr_thread_t *thread, void *data)
{
    const thread_data_t *td = data;
    apr_pool_t *iterpool;
    int i;

    printf("%s started\n", td->thread_name);
    checked("create iteration pool",
            apr_pool_create(&iterpool, td->thread_pool));
    apr_sleep(1000);

    for (int i = 0; i < 100; ++i)
    {
        apr_procattr_t *proc_attrs;
        const char *proc_args[3];
        apr_proc_t child;
        apr_status_t child_status = APR_CHILD_NOTDONE;
        apr_exit_why_e exitwhy;
        int exitcode;

        apr_pool_clear(iterpool);

        checked("create proccess attrs",
                apr_procattr_create(&proc_attrs, iterpool));
        checked("configure process streams",
                apr_procattr_io_set(proc_attrs,
                                    APR_NO_PIPE, APR_NO_PIPE, APR_NO_PIPE));

        proc_args[0] = td->program_name;
        proc_args[1] = apr_psprintf(iterpool, "child %s [%d]", td->thread_name, 
i);
        proc_args[2] = NULL;
        checked("create child process",
                apr_proc_create(&child, td->program_name,
                                proc_args, NULL, proc_attrs,
                                iterpool));

        while (child_status != APR_CHILD_DONE)
            child_status = apr_proc_wait(&child, &exitcode, &exitwhy, APR_WAIT);

        if (exitcode != 0)
        {
            if (child_status == APR_PROC_EXIT)
                fprintf(stderr, "error: child %s [%d] exited with unexpected 
code %d\n",
                        td->thread_name, i, exitcode);
            else
            {
                fprintf(stderr, "error: child %s [%d] exited on signal %d\n",
                        td->thread_name, i, exitcode);
                abort();
            }
        }
    }

    printf("%s ended\n", td->thread_name);
    return NULL;
}


int main(int argc, const char *argv[])
{
    apr_thread_t* thread[100];
    const apr_size_t num_threads = sizeof(thread)/sizeof(thread[0]);

    apr_allocator_t *allocator;
    apr_pool_t *pool;
    int i;

    apr_initialize();
    atexit(apr_terminate);

    if (argc > 1)
        print_args(argc, argv);

    checked("create global pool",
            apr_pool_create(&pool, NULL));
    allocator = apr_pool_allocator_get(pool);
    if (apr_allocator_mutex_get(allocator) == NULL)
    {
        apr_thread_mutex_t *mutex;
        checked("create allocator mutex",
                apr_thread_mutex_create(&mutex, APR_THREAD_MUTEX_DEFAULT,
                                        apr_allocator_owner_get(allocator)));
        apr_allocator_mutex_set(allocator, mutex);
    }
    apr_pool_abort_set(pool_abort, pool);

    for (i = 0; i < num_threads; ++i)
    {
        apr_allocator_t *thread_allocator;
        apr_pool_t *thread_pool;
        apr_threadattr_t *thread_attrs;
        thread_data_t *td;

        checked("create thread allocator",
                apr_allocator_create(&thread_allocator));
        checked("create thread pool",
                apr_pool_create_ex(&thread_pool, pool, NULL, thread_allocator));

        checked("create thread attributes",
                apr_threadattr_create(&thread_attrs, thread_pool));
        checked("set thread detached state",
                apr_threadattr_detach_set(thread_attrs, 0));

        td = apr_palloc(thread_pool, sizeof(*td));
        td->thread_pool = thread_pool;
        td->thread_name = apr_psprintf(thread_pool, "thread %d", i);
        td->program_name = argv[0];

        checked("create new thread",
                apr_thread_create(&thread[i], thread_attrs, thread_func,
                                  td, thread_pool));
    }

    for (i = 0; i < num_threads; ++i)
    {
        apr_status_t thread_status;
        checked("wait for thread",
                apr_thread_join(&thread_status, thread[i]));
        if (thread_status != APR_SUCCESS)
            fprintf(stderr, "thread %d status %ld\n", i, (long)thread_status);
    }

    return 0;
}

Reply via email to