Den ons 18 dec. 2024 kl 16:43 skrev Guido Jäkel <g.jae...@dnb.de>: > > > Did I understand you correctly that you were seeing the issue before the > upgrade as well, but more frequent now? > Yes, we saw it also before during the last 12m using the "previous" > version (lxc container image) of our subversion service. But this was to > seldom to really care about. Now, with the "current" version -- refreshed > this days -- it happens within hours. To build software here, there's also > an about one year more current build chain or system libs like the glibc. > This could cause a mutex issue to become more problematic because of > different timings. > > We never observe the issue in our test or approval stage, but there's no > "real world load" on the subversion service. > > The container is also executed together with a whole bunch on one of our > blade server with 112 cores, but the Subversion container is limited to two > cores. > > > Reading up on it, I see that this is exactly the same failure you got. > I'm leaning towards an APR issue (possibly related to the exact > kernel/libpthread you are using), I never got around to float it on the APR > list since it is still quite a bit of "Subversion code" involved. My next > step would be to try to write a minimum reproduction application with a > bunch of threads all randomy trying to get and release the same mutex. > > If you provide me the source code I'm probably able to compile an run it > here, too. >
I have attached my test program which tries to exercise the mutex libraries. The program contains a simple linked list with 100 threads adding 100 items to the end of the list. In the end, the list should contain 100 items for each thread with a counter incrementing by 1 (per thread), but with threads intermixed randomly. There is of course a race condition updating the pointer to the last node of the linked list where two threads mustn't update it simultaneously, so it is protected by a mutex. The Makefile must be adjusted to point to the correct locations of the APR library. After compiling, the program can be executed with no argument (which defaults to using a mutex) or with a single "n" to run without mutex. Sample run: [[[ $ ./main Creating mutexes and threads... Done Waiting for threads to finish... Finished Checking for errors... No errors! $ ./main n Creating mutexes and threads... Done Waiting for threads to finish... Finished Checking for errors... Thread 32 missing 6 found 7 Thread 48 missing 2 found 3 Thread 30 missing 14 found 15 ]]] To me, this proves that my basic program is correct and that with a mutex, the shared memory is protected and without the mutex we get errors. I have tested this under both Ubuntu and Guix with the same results. The Guix environment is the same environment where I have been able to reproduce the assert() in Subversion which started the discussion. So it seems that the mutex code is not the culprit. But I still don't understand why Subversion, running on Ubuntu with threading enabled can pass the test suite while the same code running under Guix is failing. I would expect that my test program also succeeds under Gentoo, but it would still be interesting to see. ... I realise we accidentally had some discussion offlist, so for the benefit of the list I will repeat one question I've already asked: - Is it possible to run the Subversion test suite (make check -jN PARALLEL=N, where N is the number of parallell threads during testing) and does it randomly fail? If you check the tests.log file after a failure, do you see the same assertion as in svnserve? Cheers, Daniel
#include <apr_pools.h>
#include <apr_portable.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
/* Struct for a linked list, store the thread identifier and loop count */
typedef struct ll
{
int thread;
int count;
struct ll *next;
} ll_t;
/* An APR mutex */
apr_thread_mutex_t *ll_mutex = NULL;
/* How many threads to run */
#define THREADS 100
/* Array of threads */
apr_thread_t *threads[THREADS];
/* How many times the loop will run within each thread */
#define COUNTS 100
/* The current HEAD of the linked list. Each thread will try to update this */
ll_t *head = 0;
/* If we should use mutexes to protect the update of *head */
int useMutex = -1;
/* Macro taking care of APR error handling for functions returning apr_status_t */
#define err(statcode) if (statcode != 0) { \
char errmsg[80]; \
apr_strerror(statcode, errmsg, sizeof(errmsg)); \
printf("Error %d on line %d: %s\n", statcode, __LINE__, errmsg); \
exit(statcode); \
}
/* The actual work performed by each thread */
void *threadfunc(apr_thread_t *thread, void *data)
{
/* New item to be inserted into the linked list */
ll_t *curr;
/* The thread id is the loop counter when we create the thread. Store
the value in a local variable since it might change */
int threadid = *(int*)data;
for (int i=0; i<COUNTS; i++) {
/* Create a new linked list item and populate with data.
We set ->count to the loop counter, meaning there should be one item
for each number in [0 .. COUNTS-1]! */
curr = malloc(sizeof(ll_t));
curr->next = 0;
curr->thread = threadid;
curr->count = i;
/* Append the item to the end of the linked list. Aquire the mutext unless
command line arguments told us to NOT use the mutex. */
if (useMutex)
err(apr_thread_mutex_lock(ll_mutex));
head->next = curr;
head = curr;
/* Sleep some to increase the chance that some other thread tries to update head */
usleep(10);
apr_thread_yield();
/* Finally release the mutex and let some other threads work */
if (useMutex)
err(apr_thread_mutex_unlock(ll_mutex));
usleep(10);
apr_thread_yield();
}
/* We're done - exit the thread and return */
apr_thread_exit(thread, APR_SUCCESS);
return NULL;
}
int main(int argc, char *argv[])
{
/* Pointer to the start of the linked list, used only in the main thread */
ll_t *start = 0;
/* This is an array of the expected value of the next .count for the given thread */
int next[THREADS];
/* The command line argument n will make us not use a mutex */
if (argc > 1 && strcmp(argv[1], "n") == 0)
useMutex = 0;
/* Initialize APR and create a memory pool */
err(apr_initialize());
atexit(apr_terminate);
apr_pool_t *pool;
err(apr_pool_create(&pool, NULL));
/* Allocate some memory for the initial linked list entry */
start = malloc(sizeof(ll_t));
start->thread = -1;
start->count = -1;
head = start;
/* Create the threads, having threadfunc do the work */
apr_threadattr_t *threadattr;
err(apr_threadattr_create(&threadattr, pool));
printf("Creating mutex and threads...");
err(apr_thread_mutex_create(&ll_mutex, APR_THREAD_MUTEX_DEFAULT, pool));
for (int i = 0; i < THREADS; ++i) {
/* We borrow the next array to pass the thread number to threadfunc *data
to make sure each thread has its own memory for thread id */
next[i] = i;
apr_thread_create(&threads[i], threadattr, threadfunc, &next[i], pool);
}
printf(" Done\n");
/* Wait for threads to finish */
printf("Waiting for threads to finish...");
for (int i = 0; i < THREADS; ++i) {
apr_status_t retval;
apr_thread_join(&retval, threads[i]);
/* Reset the next array to only zero - this is the expected .count for the first item */
next[i] = 0;
}
printf(" Finished\n");
/* Walk through the linked list
We expect a list of items with random mixed ->thread number, but for each
->thread, the ->count should increment by 1 each time that thead numer appears.
This could fail if two threads tried to update head at the same time, in which case
some ->count would be missing */
printf("Checking for errors...");
int errors = 0;
head = start->next;
while (head) {
/* Uncomment the following line to print the content of the linked list in memory order */
/* printf("Current %p Thread %d Count %d Next %p\n", head, head->thread, head->count, head->next); */
assert(head->thread >= 0 && head->thread < THREADS);
if (head->count != next[head->thread]) {
printf("\nThread %d missing %d found %d", head->thread, next[head->thread], head->count);
++errors;
}
/* We saw ->count for ->thread, note it down in the array until we see the same thread again */
next[head->thread] = head->count+1;
head = head->next;
}
/* Clean up and report result */
apr_pool_destroy(pool);
if (errors == 0) {
printf(" No errors!\n");
} else {
printf("\n");
}
return errors;
}
Makefile
Description: Binary data
