On Jun  4 14:07, Fiona Ebner wrote:
> The bdrv_refresh_limits() function and driver implementations are
> called with the graph lock held. The implementation for the 'compress'
> filter calls bdrv_get_info(), which is a generated coroutine wrapper
> and thus polls. This can lead to a deadlock when issuing a
> blockdev-snapshot QMP command, when bdrv_refresh_limits() is called in
> bdrv_append() while the graph lock is held exclusively. This deadlock
> was introduced with commit 5661a00d40 ("block: Call transaction
> callbacks with lock held").
> 
> As a solution, this reverts commit 3d47eb0a2a ("block:
> Convert bdrv_get_info() to co_wrapper_mixed"). To do this, it is
> necessary to have callers of bdrv_get_info() take the graph lock
> themselves. None of the driver implementations rely on being run in
> coroutine context and none of the callers rely on the function being
> a coroutine.
> 
> All callers except one either already hold the graph lock or can claim
> the graph lock via bdrv_graph_rdlock_main_loop(). As part of this,
> bdrv_get_default_bitmap_granularity() is annotated with GRAPH_RDLOCK
> and its callers adapted where necessary.
> 
> The single exception is the caller nvme_ns_init_format(), which can
> run as a callback in an IO thread, but can also be reached via the QOM
> realize handler nvme_ns_realize(). For this caller, a
> bdrv_get_info_unlocked() coroutine wrapper is introduced that must be
> called with the graph unlocked.
> 

> diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> index 6df2e8e7c5..ee3eabb1aa 100644
> --- a/hw/nvme/ns.c
> +++ b/hw/nvme/ns.c
> @@ -50,7 +50,7 @@ void nvme_ns_init_format(NvmeNamespace *ns)
>  
>      npdg = ns->blkconf.discard_granularity / ns->lbasz;
>  
> -    ret = bdrv_get_info(blk_bs(ns->blkconf.blk), &bdi);
> +    ret = bdrv_get_info_unlocked(blk_bs(ns->blkconf.blk), &bdi);
>      if (ret >= 0 && bdi.cluster_size > ns->blkconf.discard_granularity) {
>          npdg = bdi.cluster_size / ns->lbasz;
>      }

Acked-by: Klaus Jensen <k.jen...@samsung.com>

FWIW, if there is a better way to get the cluster size I'd very much
like to change what we do in hw/nvme for that. We need it to
infer/compute the "preferred deallocation granularity".

But maybe, it's just not the correct way to do this, and we shouldn't
try to do it at all and not report a preferred dealllocation
granularity. It does seem brittle, or?

Attachment: signature.asc
Description: PGP signature

Reply via email to