On 29. 5. 2026 12:53, Branko Čibej wrote:
On 29. 5. 2026 12:16, Branko Čibej wrote:
On 29. 5. 2026 08:50, Branko Čibej wrote:
On 29. 5. 2026 00:17, Jun Omae wrote:
Hi,
On 2026/05/29 4:37, Branko Čibej wrote:
This intermittent failure shows up sometimes on both Linux and macOS:
FAIL: ra-test: Unknown test failure (-11); see tests.log.
That's a crash, 11 is SIGSEGV on both OSes. Sometimes the same happens in
repos-test, too, and it doesn't matter if its local or DAV or svnserve. I
suspect that we have a wonderful little bug in our code. I've been trying, on
and off, to trace this down either with a debugger or with clang's address
sanitiser or with APR's pool debugging enabled, but it's an elusive little
bxtrd and I've not yet been able to track it down. Not even so far as to figure
out whether it's hiding in FSFS or somewhere else.
It's been a few years and I think I remember seeing it on the 1.14 branch, too.
On the other hand, I don't recall any bug reports that could be linked to this
observation. We're seeing this failure in the CI tests, too, at least with
autotools. I haven't seen this with CMake, but failures in those workflows are
far more often related to vcpkg or other environmental stuff, so it's a bit
hard to find.
If anyone has any idea where to look without diving into a line-by-line review
of the code, please share your thoughts. This is starting to get on my nerves,
just a little bit.
The issue occurs in serf_default_destroy_and_data() via
apr_pool_cleanup_for_exec() from the child process after apr_proc_create() to
open a tunnel with multi-threaded. I assume that libsvn_ra_serf and/or serf
have something wrong. Also, I guess that the same issue might occur when a hook
is invoked with SVNMasterURI enabled on Apache mpm event or worker.
It happens without DAV, too, and with prefork mpm – the only one
that even works on macOS, so I always use that for testing.
This is what I caught on macOS. It was davautocheck, but Serf is not
involved:
(lldb) target create "/Volumes/svn-test/trunk/subversion/tests/libsvn_ra/.libs/ra-test"
--core "/cores/core.9605"
Core file '/cores/core.9605' (arm64) was loaded.
(lldb) bt
* thread #2, stop reason = ESR_EC_DABORT_EL0 (fault address: 0x0)
* frame #0: 0x0000000104a83524
ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0,
tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
frame #1: 0x0000000104d6b5f0
libsvn_ra_svn-1.0.dylib`close_tunnel_cleanup(baton=0x000000072d08e258) at
client.c:597:5
frame #2: 0x00000001051fd1b4 libapr-1.0.dylib`apr_pool_destroy + 88
frame #3: 0x00000001051fd1a0 libapr-1.0.dylib`apr_pool_destroy + 68
frame #4: 0x0000000104b353a0
libsvn_ra-1.0.dylib`svn_ra_open5(session_p=0x000000016b5b2ec0,
corrected_url_p=0x0000000000000000, redirect_url_p=0x0000000000000000,
repos_URL="svn+test://localhost/test-repo-tunnel", uuid=0x0000000000000000,
callbacks=0x000000072d03c0d8, callback_baton=0x0000000000000000,
config=0x0000000000000000, pool=0x000000072d040028) at ra_loader.c:395:7
frame #5: 0x0000000104a7992c
ra-test`tunnel_callback_test(opts=0x000000016b3826b8, pool=0x000000072d03c028)
at ra-test.c:469:3
frame #6: 0x0000000104aba8a0
libsvn_test-1.0.dylib`test_thread(thread=0x000000072d01c150,
data=0x000000016b382570) at svn_test_main.c:577:15
frame #7: 0x0000000187549c58 libsystem_pthread.dylib`_pthread_start + 136
Note that this test runs against svnserve with just 'check' too, no
DAV involved at all. So it's either caused by something in
libsvn_ra_svn, or more likely, in ra-test itself. Those tunnels are
notoriously tricky, I remember we had trouble with them before in
JavaHL, too.
Finally got some interesting info from the core file – it takes some
work to get core dumps on the Mac, setting ulimit -c isn't enough.
frame #0: 0x0000000104a83524
ra-test`close_tunnel(tunnel_context=0x000000072d08e6b0,
tunnel_baton=0x000000072d03c0a0) at ra-test.c:329:7
326 apr_proc_wait(b->proc, &child_exit_code,
&child_exit_why, APR_WAIT);
327
328 SVN_TEST_ASSERT_NO_RETURN(child_exit_status ==
APR_CHILD_DONE);
-> 329 SVN_TEST_ASSERT_NO_RETURN(child_exit_code == 0);
330 SVN_TEST_ASSERT_NO_RETURN(child_exit_why ==
APR_PROC_EXIT);
331 }
332 }
(lldb) p child_exit_status
(apr_status_t) 70005
(lldb) p child_exit_code
(int) 10
(lldb) p child_exit_why
(apr_exit_why_e) APR_PROC_SIGNAL | APR_PROC_SIGNAL_CORE
That means svnserve crashed with a SIGBUS. I'll keep investigating.
The SIGBUS is real, but it's not in svnserve – it's in the forked
ra-test and happens before the exec() of svnserve. I had to build apr
and apr-util with debugging enabled to find this, and here's the
relevant part of the stack. For context: On Unix, apr_proc_create()
forks the parent then runs cleanups on the global pool before exec()'ing
the real child.
(lldb) down
frame #3: 0x0000000101114b10
libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087ac8e028) at
apr_pools.c:2702:9
2699 run_child_cleanups(&p->cleanups);
2700
2701 for (p = p->child; p; p = p->sibling)
-> 2702 cleanup_pool_for_exec(p);
2703 }
2704
2705 APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
(lldb) p *p
(apr_pool_t) {
parent = 0x000000087ac44028
child = 0x000000087a844028
sibling = 0x000000087ac94028
ref = 0x000000087ac44030
cleanups = NULL
free_cleanups = 0x000000087ac8e7e8
allocator = 0x0000000101500b40
subprocesses = NULL
abort_fn = 0x0000000101018d94 (libsvn_subr-1.0.dylib`abort_on_pool_failure at
pool.c:53)
user_data = NULL
tag = 0x0000000000000000
active = 0x000000087a800000
self = 0x000000087ac8e000
self_first_avail = 0x000000087ac8e0a0 "file"
pre_cleanups = NULL
}
(lldb) down
frame #2: 0x0000000101114b10
libapr-1.0.dylib`cleanup_pool_for_exec(p=0x000000087a844028) at
apr_pools.c:2702:9
2699 run_child_cleanups(&p->cleanups);
2700
2701 for (p = p->child; p; p = p->sibling)
-> 2702 cleanup_pool_for_exec(p);
2703 }
2704
2705 APR_DECLARE(void) apr_pool_cleanup_for_exec(void)
(lldb) p *p
(apr_pool_t) {
parent = 0x0000000000000a35
child = 0x0000000000000011
sibling = 0x0000000000000002
ref = 0x0000000060232b75
cleanups = 0x0000000000000001
free_cleanups = 0x0000000000000003
allocator = 0x0000000000000011
subprocesses = 0x0000000000000059
abort_fn = 0x0000000000000005
user_data = 0x00000000403dbe48
tag = 0x0000000000000001 ""
active = 0x0000000000000002
self = 0x000000000000006a
self_first_avail = 0x0000000000000001 ""
pre_cleanups = 0x0000000000000006
}
Note he pool structure from frame #2: it's a clear case of memory
corruption. Most of the pointers there are invalid, some are also
misaligned. Calling run_child_cleanups(&p->cleanups) triggers an
immediate SIGBUS because cleanups is misaligned. The fun part now will
be trying to find out if this is a bug in APR pools – unlikely – or a
memory corruption bug in ra-test or libsvn_ra – also unlikely but it has
to be one of these.
What finally pointed me in the right direction is that for every crash
of ra-test, /two/ core files were created. One from the parent which
aborted because the child's exit code was non-zero, and one from the
child. Well, at least it won't be boring.
-- Brane