On 3/4/25 19:48, Andrey Zhadchenko wrote:


On 3/4/25 12:32, Pavel Tikhomirov wrote:


On 3/4/25 18:51, Andrey Zhadchenko wrote:


On 3/3/25 10:37, Pavel Tikhomirov wrote:
This adds merge_backward "start", "complete" and "cancel" commands. By
that we are able to split single merge_backward into two stages: start
asyncronous merging and completion. That can be usefull for restarting
qemu process while allowing backward merging to run asyncronously in
kernel.

The "start" command runs merging preparations in workqueue work. After
it finishes, the "complete" command can be called to finish the process
and actually replace the top qcow2 with it's lower. The "cancel" command
forces the work to stop and flushes it. In case we are in completion
waiting state already and there is no work running, the "cancel" command
also reverts merging preparations.

Locking:

Data in tgt->backward_merge is protected by tgt->ctl_mutex. The "start"
and "complete" commands are fully under this lock, and the "cancel"
operation takes the lock explicitly and releases it for work flushing.
The work also takes the lock but only when updating tgt->backward_merge
data. For checks, if the work was caneled in the middle, we read the
state without locking as we don't modify the state there, also we would
re-check the state again before exiting the work function under lock.

Now on target suspend we "cancel" currently running backward merge,
previously we were just hanging untill backward merge have been
finished for possibly a long time, cancelling seems cleaner. Though we
don't really expect hypervisor suspending the target in the middle of
backward merge that it by itself started.

https://virtuozzo.atlassian.net/browse/VSTOR-100466
Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com>

--
v2: Cancel from BACKWARD_MERGE_START state should not try to stop the
work via BACKWARD_MERGE_STOP state, else we will deadlock in this state.
---
  drivers/md/dm-qcow2-cmd.c    | 142 ++++++++++++++++++++++++++++++ +----
  drivers/md/dm-qcow2-target.c |   6 ++
  drivers/md/dm-qcow2.h        |  19 +++++
  3 files changed, 153 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-qcow2-cmd.c b/drivers/md/dm-qcow2-cmd.c
index 7b4b0ee68ad9f..04a992f3ebba6 100644
--- a/drivers/md/dm-qcow2-cmd.c
+++ b/drivers/md/dm-qcow2-cmd.c
@@ -52,6 +52,8 @@ static void service_qio_endio(struct qcow2_target *tgt, struct qio *qio,
      wake_up(&tgt->service_wq);
  }
+static bool qcow2_backward_merge_should_stop(struct qcow2_target *tgt);
+
  static int qcow2_service_iter(struct qcow2_target *tgt, struct qcow2 *qcow2,
            loff_t end, loff_t step, unsigned int bi_op, u8 qio_flags)
  {
@@ -63,7 +65,7 @@ static int qcow2_service_iter(struct qcow2_target *tgt, struct qcow2 *qcow2,
      WRITE_ONCE(service_status, BLK_STS_OK);
      for (pos = 0; pos < end; pos += step) {
-        if (fatal_signal_pending(current)) {
+        if (qcow2_backward_merge_should_stop(tgt)) {
              ret = -EINTR;
              break;
          }
@@ -161,10 +163,11 @@ static void set_backward_merge_in_process(struct qcow2_target *tgt,
      qcow2_submit_embedded_qios(tgt, &list);
  }
-static int qcow2_merge_backward(struct qcow2_target *tgt)
+static int qcow2_merge_backward_start(struct qcow2_target *tgt)
  {
      struct qcow2 *qcow2 = tgt->top, *lower = qcow2->lower;
-    int ret, ret2;
+
+    lockdep_assert_held(&tgt->ctl_mutex);
      if (!lower)
          return -ENOENT;
@@ -174,6 +177,35 @@ static int qcow2_merge_backward(struct qcow2_target *tgt)
          return -EOPNOTSUPP;
      if (lower->hdr.size < qcow2->hdr.size)
          return -EBADSLT;
+
+    if (tgt->backward_merge.state != BACKWARD_MERGE_STOPPED)
+        return -EBUSY;
+    tgt->backward_merge.state = BACKWARD_MERGE_START;
+    tgt->backward_merge.error = 0;
+
+    schedule_work(&tgt->backward_merge.work);

Does this imply we potentially occupy one of the workers of the global pool for the indefinite amount of time? What if we run as much as nworkers (probably ncpus) merges simultaneously?

System_wq has 1024*NCPU execution contexts:

 > ``@max_active`` determines the maximum number of execution contexts per CPU

 > The maximum limit for ``@max_active`` is 2048 and the default value used
when 0 is specified is 1024.

If we try to run ~1024 works per cpu at the same time we might have a problem, and will need to either swithch to our own work-queue or create explicit kernel thread for each merge.


As flushing system-wide workqueues is now deprecated we are also fine with long running work in system_wq and not in system_long_wq, but we can move it to system_long_wq just to be on the safe side.

  * system_wq is the one used by schedule[_delayed]_work[_on]().
  * Multi-CPU multi-threaded.  There are users which expect relatively
  * short queue flush time.  Don't queue works which can run for too
  * long.

  * system_long_wq is similar to system_wq but may host long running
  * works.  Queue flushing might take relatively long.

What do you think?

In close approximation we can just run on dm-qcow2 workqueue, as each image allocates it's own wq, tgt->wq. And we only run 2 works there, general and flush. However Alexander's dm-ploop rework dropped workqueues in favor of threads, so I do not know if and when we intend to merge it into dm-qcow2.

With given limits 1024 per cpu we probably could ignore constraints on global workqueue. Could you please also give me some links where I can read this? Couldn't find anything myself

Strange, both citations are grepable in both mainstream and vz9 trees:

linux$ git grep "system_wq is the one used by schedule"
include/linux/workqueue.h: * system_wq is the one used by schedule[_delayed]_work[_on]().

https://github.com/torvalds/linux/blob/48a5eed9ad584315c30ed35204510536235ce402/include/linux/workqueue.h#L430

linux$ git grep "determines the maximum number of execution contexts"
Documentation/core-api/workqueue.rst:``@max_active`` determines the maximum number of execution contexts per

https://github.com/torvalds/linux/blob/48a5eed9ad584315c30ed35204510536235ce402/Documentation/core-api/workqueue.rst?plain=1#L242

In VZ9:

kernel-vz9$ git grep "system_wq is the one used by schedule"
include/linux/workqueue.h: * system_wq is the one used by schedule[_delayed]_work[_on]().

kernel-vz9$ git grep "determines the maximum number of execution contexts"
Documentation/core-api/workqueue.rst:``@max_active`` determines the maximum number of execution contexts




+    return 0;
+}
+ALLOW_ERROR_INJECTION(qcow2_merge_backward_start, ERRNO);
+
+void qcow2_merge_backward_work(struct work_struct *work)
+{
+    struct qcow2_target *tgt = container_of(work, struct qcow2_target,
+                        backward_merge.work);
+    struct qcow2 *qcow2, *lower;
+    int ret, ret2;
+
+    mutex_lock(&tgt->ctl_mutex);
+    if (tgt->backward_merge.state != BACKWARD_MERGE_START) {
+        mutex_unlock(&tgt->ctl_mutex);
+        return;
+    }
+    tgt->backward_merge.state = BACKWARD_MERGE_RUN;
+    mutex_unlock(&tgt->ctl_mutex);
+
+    qcow2 = tgt->top;
+    lower = qcow2->lower;
+
      /*
       * Break all COW clus at L1 level. Otherwise, later
       * there would be problems with unusing them:
@@ -183,13 +215,13 @@ static int qcow2_merge_backward(struct qcow2_target *tgt)
      ret = qcow2_break_l1cow(tgt);
      if (ret) {
          QC_ERR(tgt->ti, "Can't break L1 COW");
-        return ret;
+        goto out_err;
      }
      ret = qcow2_set_image_file_features(lower, true);
      if (ret) {
          QC_ERR(tgt->ti, "Can't set dirty bit");
-        return ret;
+        goto out_err;
      }
      set_backward_merge_in_process(tgt, qcow2, true);
@@ -200,22 +232,85 @@ static int qcow2_merge_backward(struct qcow2_target *tgt)
          ret2 = qcow2_set_image_file_features(lower, false);
          if (ret2 < 0)
              QC_ERR(tgt->ti, "Can't unuse lower (%d)", ret2);
-        return ret;
      }
+
+out_err:
+    mutex_lock(&tgt->ctl_mutex);
+    if (ret) {
+        /* Error */
+        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
+        tgt->backward_merge.error = ret;
+    } else if (tgt->backward_merge.state == BACKWARD_MERGE_STOP) {
+        /* Merge is canceled */
+        set_backward_merge_in_process(tgt, qcow2, false);
+        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
+        tgt->backward_merge.error = -EINTR;
+    } else {
+        /* Finish merge */
+        tgt->backward_merge.state = BACKWARD_MERGE_WAIT_COMPLETION;
+    }
+    mutex_unlock(&tgt->ctl_mutex);
+}
+
+static int qcow2_merge_backward_complete(struct qcow2_target *tgt)
+{
+    struct qcow2 *qcow2 = tgt->top, *lower = qcow2->lower;
+    int ret;
+
+    lockdep_assert_held(&tgt->ctl_mutex);
+
+    if (tgt->backward_merge.state != BACKWARD_MERGE_WAIT_COMPLETION)
+        return -EBUSY;
+
      tgt->top = lower;
      smp_wmb(); /* Pairs with qcow2_ref_inc() */
      qcow2_inflight_ref_switch(tgt); /* Pending qios */
      qcow2_flush_deferred_activity(tgt, qcow2); /* Delayed md pages */
      qcow2->lower = NULL;
-    ret2 = qcow2_set_image_file_features(qcow2, false);
-    if (ret2 < 0)
-        QC_ERR(tgt->ti, "Can't unuse merged img (%d)", ret2);
+    ret = qcow2_set_image_file_features(qcow2, false);
+    if (ret < 0)
+        QC_ERR(tgt->ti, "Can't unuse merged img (%d)", ret);
      qcow2_destroy(qcow2);
+    tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
+
      return 0;
  }
-ALLOW_ERROR_INJECTION(qcow2_merge_backward, ERRNO);
+ALLOW_ERROR_INJECTION(qcow2_merge_backward_complete, ERRNO);
+
+void qcow2_merge_backward_cancel(struct qcow2_target *tgt)
+{
+    bool flush = false;
+
+    mutex_lock(&tgt->ctl_mutex);
+    if (tgt->backward_merge.state == BACKWARD_MERGE_STOPPED) {
+        mutex_unlock(&tgt->ctl_mutex);
+        return;
+    }
+
+    if (tgt->backward_merge.state == BACKWARD_MERGE_START) {
+        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
+        flush = true;
+    } else if (tgt->backward_merge.state == BACKWARD_MERGE_RUN) {
+        tgt->backward_merge.state = BACKWARD_MERGE_STOP;
+        flush = true;
+    } else if (tgt->backward_merge.state == BACKWARD_MERGE_STOP) {
+        flush = true;
+    } else if (tgt->backward_merge.state == BACKWARD_MERGE_WAIT_COMPLETION) {
+        set_backward_merge_in_process(tgt, tgt->top, false);
+        tgt->backward_merge.state = BACKWARD_MERGE_STOPPED;
+    }
+    mutex_unlock(&tgt->ctl_mutex);
+
+    if (flush)
+        flush_work(&tgt->backward_merge.work);
+}
+
+static bool qcow2_backward_merge_should_stop(struct qcow2_target *tgt)
+{
+    return READ_ONCE(tgt->backward_merge.state) == BACKWARD_MERGE_STOP;
+}
  static struct qcow2 *qcow2_get_img(struct qcow2_target *tgt, u32 img_id, u8 *ref_index)
  {
@@ -374,11 +469,19 @@ int qcow2_message(struct dm_target *ti, unsigned int argc, char **argv,
          }
          ret = qcow2_get_event(tgt, result, maxlen);
          goto out;
+    } else if (!strcmp(argv[0], "merge_backward")) {
+        if (argc != 2) {
+            ret = -EINVAL;
+            goto out;
+        }
+        if (!strcmp(argv[1], "cancel")) {
+            qcow2_merge_backward_cancel(tgt);
+            ret = 0;
+            goto out;
+        }
      }
-    ret = mutex_lock_killable(&tgt->ctl_mutex);
-    if (ret)
-        goto out;
+    mutex_lock(&tgt->ctl_mutex);
      if (!strcmp(argv[0], "get_errors")) {
          ret = qcow2_get_errors(tgt, result, maxlen);
@@ -388,7 +491,18 @@ int qcow2_message(struct dm_target *ti, unsigned int argc, char **argv,
      } else if (!strcmp(argv[0], "merge_forward")) {
          ret = qcow2_merge_forward(tgt);
      } else if (!strcmp(argv[0], "merge_backward")) {
-        ret = qcow2_merge_backward(tgt);
+        if (argc != 2) {
+            ret = -EINVAL;
+            mutex_unlock(&tgt->ctl_mutex);
+            goto out;
+        }
+        if (!strcmp(argv[1], "start")) {
+            ret = qcow2_merge_backward_start(tgt);
+        } else if (!strcmp(argv[1], "complete")) {
+            ret = qcow2_merge_backward_complete(tgt);
+        } else {
+            ret = -ENOTTY;
+        }
      } else {
          ret = -ENOTTY;
      }
diff --git a/drivers/md/dm-qcow2-target.c b/drivers/md/dm-qcow2- target.c
index 540c03cb3c44f..6e2e583ba0b8b 100644
--- a/drivers/md/dm-qcow2-target.c
+++ b/drivers/md/dm-qcow2-target.c
@@ -25,6 +25,8 @@ static void qcow2_set_service_operations(struct dm_target *ti, bool allowed)
      mutex_lock(&tgt->ctl_mutex);
      tgt->service_operations_allowed = allowed;
      mutex_unlock(&tgt->ctl_mutex);
+    if (!allowed)
+        qcow2_merge_backward_cancel(tgt);
  }
  static void qcow2_set_wants_suspend(struct dm_target *ti, bool wants)
  {
@@ -251,6 +253,7 @@ static void qcow2_tgt_destroy(struct qcow2_target *tgt)
          /* Now kill the queue */
          destroy_workqueue(tgt->wq);
      }
+    qcow2_merge_backward_cancel(tgt);
      mempool_destroy(tgt->qio_pool);
      mempool_destroy(tgt->qrq_pool);
@@ -494,6 +497,9 @@ static struct qcow2_target *alloc_qcow2_target(struct dm_target *ti)
      timer_setup(&tgt->enospc_timer, qcow2_enospc_timer, 0);
      ti->private = tgt;
      tgt->ti = ti;
+
+    INIT_WORK(&tgt->backward_merge.work, qcow2_merge_backward_work);
+
      qcow2_set_service_operations(ti, false);
      return tgt;
diff --git a/drivers/md/dm-qcow2.h b/drivers/md/dm-qcow2.h
index a89fe3db2196d..bebfdc50ed6d4 100644
--- a/drivers/md/dm-qcow2.h
+++ b/drivers/md/dm-qcow2.h
@@ -149,6 +149,20 @@ struct md_page {
      struct list_head wpc_readers_wait_list;
  };
+enum qcow2_backward_merge_state {
+    BACKWARD_MERGE_STOPPED = 0,
+    BACKWARD_MERGE_START,
+    BACKWARD_MERGE_RUN,
+    BACKWARD_MERGE_WAIT_COMPLETION,
+    BACKWARD_MERGE_STOP,
+};
+
+struct qcow2_backward_merge {
+    struct work_struct work;
+    enum qcow2_backward_merge_state state;
+    int error;
+};
+
  struct qcow2_target {
      struct dm_target *ti;
  #define QCOW2_QRQ_POOL_SIZE 512 /* Twice nr_requests from blk_mq_init_sched() */
@@ -180,6 +194,8 @@ struct qcow2_target {
      struct work_struct event_work;
      spinlock_t event_lock;
      struct mutex ctl_mutex;
+
+    struct qcow2_backward_merge backward_merge;
  };
  enum {
@@ -375,6 +391,9 @@ int qcow2_inflight_ref_switch(struct qcow2_target *tgt);   void qcow2_flush_deferred_activity(struct qcow2_target *tgt, struct qcow2 *qcow2);
  int qcow2_truncate_safe(struct file *file, loff_t new_len);
+void qcow2_merge_backward_work(struct work_struct *work);
+void qcow2_merge_backward_cancel(struct qcow2_target *tgt);
+
  static inline struct qcow2_target *to_qcow2_target(struct dm_target *ti)
  {
      return ti->private;




--
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.

_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to