On 8 October 2018 at 10:12, Peter Maydell <peter.mayd...@linaro.org> wrote: > I looked back at the backtrace/etc that I posted earlier in this > thread, and it looked to me like maybe a memory corruption issue. > So I tried running the test under valgrind on Linux, and:
...which goes away if I do a complete build from clean, so presumably is the result of a stale .o file? The OSX version I'm running doesn't support valgrind, but the C compiler does have the clang sanitizers. Here's a log from a build with -fsanitize=address -fsanitize=undefined of commit df51a005192ee40b: $ ./tests/test-bdrv-drain /bdrv-drain/nested: ==60415==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7ffee500e000; bottom 0x00010fa0d000; size: 0x7ffdd5601000 (140728183296000) False positive error reports may follow For details see https://github.com/google/sanitizers/issues/189 OK /bdrv-drain/multiparent: OK /bdrv-drain/driver-cb/drain_all: OK /bdrv-drain/driver-cb/drain: OK /bdrv-drain/driver-cb/drain_subtree: OK /bdrv-drain/driver-cb/co/drain_all: OK /bdrv-drain/driver-cb/co/drain: OK /bdrv-drain/driver-cb/co/drain_subtree: OK /bdrv-drain/quiesce/drain_all: OK /bdrv-drain/quiesce/drain: OK /bdrv-drain/quiesce/drain_subtree: OK /bdrv-drain/quiesce/co/drain_all: OK /bdrv-drain/quiesce/co/drain: OK /bdrv-drain/quiesce/co/drain_subtree: OK /bdrv-drain/graph-change/drain_subtree: OK /bdrv-drain/graph-change/drain_all: OK /bdrv-drain/iothread/drain_all: ================================================================= ==60415==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d000010060 at pc 0x00010b329270 bp 0x7000036c9d10 sp 0x7000036c9d08 READ of size 8 at 0x60d000010060 thread T3 #0 0x10b32926f in notifier_list_notify notify.c:39 #1 0x10b2b8622 in qemu_thread_atexit_run qemu-thread-posix.c:473 #2 0x7fff5a0e1162 in _pthread_tsd_cleanup (libsystem_pthread.dylib:x86_64+0x5162) #3 0x7fff5a0e0ee8 in _pthread_exit (libsystem_pthread.dylib:x86_64+0x4ee8) #4 0x7fff5a0df66b in _pthread_body (libsystem_pthread.dylib:x86_64+0x366b) #5 0x7fff5a0df50c in _pthread_start (libsystem_pthread.dylib:x86_64+0x350c) #6 0x7fff5a0debf8 in thread_start (libsystem_pthread.dylib:x86_64+0x2bf8) 0x60d000010060 is located 48 bytes inside of 144-byte region [0x60d000010030,0x60d0000100c0) freed by thread T3 here: #0 0x10bcc51bd in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x551bd) #1 0x7fff5a0e1162 in _pthread_tsd_cleanup (libsystem_pthread.dylib:x86_64+0x5162) #2 0x7fff5a0e0ee8 in _pthread_exit (libsystem_pthread.dylib:x86_64+0x4ee8) #3 0x7fff5a0df66b in _pthread_body (libsystem_pthread.dylib:x86_64+0x366b) #4 0x7fff5a0df50c in _pthread_start (libsystem_pthread.dylib:x86_64+0x350c) #5 0x7fff5a0debf8 in thread_start (libsystem_pthread.dylib:x86_64+0x2bf8) previously allocated by thread T3 here: #0 0x10bcc5003 in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x55003) #1 0x7fff59dc9969 in tlv_allocate_and_initialize_for_key (libdyld.dylib:x86_64+0x3969) #2 0x7fff59dca0eb in tlv_get_addr (libdyld.dylib:x86_64+0x40eb) #3 0x10b3558d6 in rcu_register_thread rcu.c:301 #4 0x10b131cb7 in iothread_run iothread.c:42 #5 0x10b2b8eff in qemu_thread_start qemu-thread-posix.c:504 #6 0x7fff5a0df660 in _pthread_body (libsystem_pthread.dylib:x86_64+0x3660) #7 0x7fff5a0df50c in _pthread_start (libsystem_pthread.dylib:x86_64+0x350c) #8 0x7fff5a0debf8 in thread_start (libsystem_pthread.dylib:x86_64+0x2bf8) Thread T3 created by T0 here: #0 0x10bcbd00d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x4d00d) #1 0x10b2b8bb5 in qemu_thread_create qemu-thread-posix.c:534 #2 0x10b131720 in iothread_new iothread.c:75 #3 0x10ac04edc in test_iothread_common test-bdrv-drain.c:668 #4 0x10abff44e in test_iothread_drain_all test-bdrv-drain.c:768 #5 0x10ba45b2b in g_test_run_suite_internal (libglib-2.0.0.dylib:x86_64+0x4fb2b) #6 0x10ba45cec in g_test_run_suite_internal (libglib-2.0.0.dylib:x86_64+0x4fcec) #7 0x10ba45cec in g_test_run_suite_internal (libglib-2.0.0.dylib:x86_64+0x4fcec) #8 0x10ba450fb in g_test_run_suite (libglib-2.0.0.dylib:x86_64+0x4f0fb) #9 0x10ba4504e in g_test_run (libglib-2.0.0.dylib:x86_64+0x4f04e) #10 0x10abf4515 in main test-bdrv-drain.c:1606 #11 0x7fff59dc7014 in start (libdyld.dylib:x86_64+0x1014) SUMMARY: AddressSanitizer: heap-use-after-free notify.c:39 in notifier_list_notify Shadow bytes around the buggy address: 0x1c1a00001fb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00001fc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00001fd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00001fe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00001ff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x1c1a00002000: fa fa fa fa fa fa fd fd fd fd fd fd[fd]fd fd fd 0x1c1a00002010: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa 0x1c1a00002020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00002030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00002040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c1a00002050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==60415==ABORTING Illegal instruction: 4 Looking at the backtraces I'm wondering if this is the result of an implicit reliance on the order in which per-thread destructors are called (which is left unspecified by POSIX) -- the destructor function qemu_thread_atexit_run() is called after some other destructor, but accesses its memory. Specifically, the memory it's trying to read looks like the __thread local variable pollfds_cleanup_notifier in util/aio-posix.c. So I think what is happening is: * util/aio-posix.c calls qemu_thread_atexit_add(), passing it a pointer to a thread-local variable pollfds_cleanup_notifier * qemu_thread_atexit_add() works by arranging to run the notifiers when its 'exit_key' variable's destructor is called * the destructor for pollfds_cleanup_notifier runs before that for exit_key, and so the qemu_thread_atexit_run() function ends up touching freed memory I'm pretty confident this analysis of the problem is correct: unfortunately I have no idea what the right way to fix it is... thanks -- PMM