> >> >> Hi, > >> >> > >> >> I noticed that bdrv_drain_all is performed in load_vmstate before > >> bdrv_snapshot_goto, >> >> >> and bdrv_drain_all is performed in qmp_transaction before >> >> internal_snapshot_prepare, >> >> >> so is it also neccesary to perform bdrv_drain_all in savevm and delvm? >> >> > >> >> >Definitely yes for savevm. do_savevm() calls it indirectly via >> >> >vm_stop(), so that part looks okay. >> >> > >> >> Yes, you are right. >> >> >> >> >delvm doesn't affect the currently running VM, and therefore doesn't >> >> >interfere with guest requests that are in flight. So I think that a >> >> >bdrv_drain_all() isn't needed there. >> >> > >> >> I'm worried about that there are still pending IOs while deleting >> >> snapshot, >> >> then is it possible that there is concurrency problem between the >> >> process of deleting snapshot >> >> and the coroutine of io read/write(bdrv_co_do_rw) invoked by the >> >> pending IOs? >> >> This coroutine is also in main thread. >> >> Am I missing something? >> > >> >What kind of concurrency problem are you thinking of? >> > >> I have encountered two problem, >> 1) double-free of Qcow2DiscardRegion in qcow2_process_discards >> please see the discussing mail: [PATCH] qcow2: fix double-free of >> Qcow2DiscardRegion in qcow2_process_discards >> 2) qcow2 image is truncated to very huge size, but the size shown by >> qemu-img info is normal >> please see the discussing mail: >> [question] is it possible that big-endian l1 table offset referenced by >> other I/O while updating l1 table offset in qcow2_update_snapshot_refcount? > >Did you verify that the invalid value actually makes sense if >byteswapped? For example, that there is no reserved bit set then? > Yes, exactly, I have verified that those l2 table offset are invalid value if byte-swapped.
>> I suspect that both of the two problems are concurrency problem mentioned >> above, >> please see the discussing mail. >> >> >> >I do see that there might be a chance of concurrency, but that doesn't >> >automatically mean the requests are conflicting. >> > >> >Would you feel better with taking s->lock in qcow2_snapshot_delete()? >> Both deleting snapshot and the coroutine of pending io >> read/write(bdrv_co_do_rw) >> are performed in main thread, could BDRVQcowState.lock work? > >Yes. s->lock is not a mutex for threads, but a coroutine based one. > Yes, you are right. >The problem, however, is that qcow2_snapshot_delete() isn't execute in a >coroutine, so we can't take s->lock here. We would either need to move >it into a coroutine or add a bdrv_drain_all() indeed. > I'm inclined to add bdrv_drain_all(), just keeping consistent with the other snapshot-related operations, like savevm, loadvm, internal_snapshot_prepare, etc. Thanks, Zhang Haoyu >This also means that we probably need to review all other cases where >non-coroutine callbacks from BlockDriver might interfere with running >requests. The original assumption that they are safe as long as they are >not running in a coroutine seems to be wrong. Agreed. > >Kevin