When a highly multi-threaded program such as nbdcopy encounters an error, there is a race condition in the library which can cause an assertion failure and thus a core dump:
(1) An error occurs on one of the threads. nbdcopy calls exit(3). (2) In lib/errors.c, the destructor calls pthread_key_delete. (3) Another thread which is still running also encounters an error, and inside libnbd the library calls set_error(). (4) The call to set_error() calls pthread_getspecific which returns NULL (since the key has already been destroyed in step (2)), and this causes us to call pthread_setspecific which returns EINVAL because glibc is able to detect invalid use of a pthread_key_t after it has been destroyed. In any case, the error message is lost, and any subsequent call to nbd_get_error() will return NULL. (5) We enter the %DEAD state, where there is an assertion that nbd_get_error() is not NULL. This assertion is supposed to be for checking that the code called set_error() before entering the %DEAD state. Although we did call set_error(), the error message was lost at step (4) and nbd_get_error() will always return NULL. This assertion failure causes a crash. There aren't any good ways to fix this. I chose to leak the pthread_key_t on the exit path. --- lib/errors.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/lib/errors.c b/lib/errors.c index 8b77650ef3..6fbfaacd34 100644 --- a/lib/errors.c +++ b/lib/errors.c @@ -69,7 +69,11 @@ errors_key_destroy (void) free (last_error->error); free (last_error); } - pthread_key_delete (errors_key); + + /* We could do this, but that causes a race condition described here: + * https://listman.redhat.com/archives/libguestfs/2023-March/031002.html + */ + //pthread_key_delete (errors_key); } /* This is called when a thread exits, to free the thread-local data -- 2.39.2 _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs