Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup

chenyuhui (A) Tue, 25 Jul 2023 01:44:35 -0700
@Peter Xu @Fabiano Rosas
Kindly ping on this.

On 2023/6/27 9:11, chenyuhui (A) wrote:
> 
> On 2023/6/26 21:16, chenyuhui (A) wrote:
>>
>> On 2023/6/21 22:22, Fabiano Rosas wrote:
>>> Jianguo Zhang via <qemu-devel@nongnu.org> writes:
>>>
>>>> From: Yuhui Chen <chenyuh...@huawei.com>
>>>>
>>>> There is a coredump while trying to destroy mutex when
>>>> p->running is false but p->mutex is not unlock.
>>>> Make sure all mutexes has been released before destroy them.
>>>>
>>>> Signed-off-by: Yuhui Chen <chenyuh...@huawei.com>
>>>> ---
>>>>  migration/multifd.c | 6 ++----
>>>>  1 file changed, 2 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/migration/multifd.c b/migration/multifd.c
>>>> index b7ad7002e0..7dcdb2d3a0 100644
>>>> --- a/migration/multifd.c
>>>> +++ b/migration/multifd.c
>>>> @@ -523,9 +523,7 @@ void multifd_save_cleanup(void)
>>>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>>>>          MultiFDSendParams *p = &multifd_send_state->params[i];
>>>>  
>>>> -        if (p->running) {
>>>
>>> The need for this flag is dubious IMO. Commit 10351fbad1
>>> ("migration/multifd: Join all multifd threads in order to avoid leaks")
>>> already moved the other join outside of it. If we figure out another way
>>> to deal with the sem_sync lockup we could probably remove this
>>> altogether.
>>
>>
>> I've seen this commit 10351fbad1, and it's seems to have the same
>> problem in function multifd_save_cleanup.
>>
>> So that may my patch only need to modify multifd_save_cleanup.
>>
>> __________________________________________________________________
>>
>>
>> On 2023/6/21 21:24, Peter Xu wrote:
>>> On Wed, Jun 21, 2023 at 04:18:26PM +0800, Jianguo Zhang via wrote:
>>>> From: Yuhui Chen<chenyuh...@huawei.com>
>>>>
>>>> There is a coredump while trying to destroy mutex when
>>>> p->running is false but p->mutex is not unlock.
>>>> Make sure all mutexes has been released before destroy them.
>>>
>>> It'll be nice to add a backtrace of the coredump here, and also copy
>>> maintainer (Juan Quintela, copied now).
>>>
>>
>> The following is coredump, and my code is base on
>> https://github.com/qemu/qemu.git tag v6.2.0.
>>
> (gdb) bt
> #0  0x0000ffffabe3b2b8 in  () at /usr/lib64/libc.so.6
> #1  0x0000ffffabdf6d7c in raise () at /usr/lib64/libc.so.6
> #2  0x0000ffffabde4d2c in abort () at /usr/lib64/libc.so.6
> #3  0x0000aaaac67fcc10 in error_exit (err=<optimized out>, 
> msg=msg@entry=0xaaaac6dc52b8 <__func__.33> "qemu_mutex_destroy") at 
> ../util/qemu-thread-posix.c:38
> #4  0x0000aaaac67fce38 in qemu_mutex_destroy 
> (mutex=mutex@entry=0xaaaafa1a4250) at ../util/qemu-thread-posix.c:71
> #5  0x0000aaaac6055688 in multifd_save_cleanup () at 
> ../migration/multifd.c:555
> #6  0x0000aaaac6050198 in migrate_fd_cleanup (s=s@entry=0xaaaaf7518800) at 
> ../migration/migration.c:1808
> #7  0x0000aaaac6050384 in migrate_fd_cleanup_bh (opaque=0xaaaaf7518800) at 
> ../migration/migration.c:1850
> #8  0x0000aaaac680d790 in aio_bh_call (bh=0xffffa0004c40) at 
> ../util/async.c:141
> #9  aio_bh_poll (ctx=ctx@entry=0xaaaaf73285a0) at ../util/async.c:169
> #10 0x0000aaaac67f9e18 in aio_dispatch (ctx=0xaaaaf73285a0) at 
> ../util/aio-posix.c:381
> #11 0x0000aaaac680d414 in aio_ctx_dispatch (source=<optimized out>, 
> callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:311
> #12 0x0000ffffac44cf88 in g_main_context_dispatch () at 
> /usr/lib64/libglib-2.0.so.0
> #13 0x0000aaaac6819214 in glib_pollfds_poll () at ../util/main-loop.c:232
> #14 os_host_main_loop_wait (timeout=735000000) at ../util/main-loop.c:255
> #15 main_loop_wait (nonblocking=nonblocking@entry=0) at 
> ../util/main-loop.c:531
> #16 0x0000aaaac65005cc in qemu_main_loop () at ../softmmu/runstate.c:726
> #17 0x0000aaaac5fe2030 in main (argc=<optimized out>, argv=<optimized out>, 
> envp=<optimized out>) at ../softmmu/main.c:50
> (gdb) q
> 
>> How reproducible:
>> 1、And sleep time to produce p->running is false but p->mutex is
>>  not unlock.(apply following patch)
>> 2、Do migration with --parallel-connections.
>>>> From: Yuhui Chen <chenyuh...@huawei.com>
>> Date: Mon, 26 Jun 2023 14:24:35 +0800
>> Subject: [DEBUG][PATCH] And sleep time to produce p->running is false but 
>> p->mutex is
>>  not unlock.
>>
>> ---
>>  migration/multifd.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 7c9deb1921..09a7b0748a 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
>>      for (i = 0; i < migrate_multifd_channels(); i++) {
>>          MultiFDSendParams *p = &multifd_send_state->params[i];
>>
>> +        sleep(2);
>>          if (p->running) {
>>              qemu_thread_join(&p->thread);
>>          }
>> @@ -719,6 +720,7 @@ out:
>>
>>      qemu_mutex_lock(&p->mutex);
>>      p->running = false;
>> +    sleep(20);
>>      qemu_mutex_unlock(&p->mutex);
>>
>>      rcu_unregister_thread();
Re: [PATCH] migrate/multifd: fix coredump when the multifd thread cleanup

Reply via email to