[pve-devel] [RFC PATCH proxmox-backup-qemu] restore: make chunk loading more parallel

Dominik Csapak Mon, 07 Jul 2025 01:15:37 -0700

by using async futures to load chunks and stream::buffer_unordered to
buffer up to 16 of them, depending on write/load speed.


With this, we don't need to increase the number of threads in the
runtime to trigger parallel reads and network traffic to us. This way
it's only limited by CPU if decoding and/or decrypting is the bottle
neck.

I measured restoring a VM backup with about 30GiB data and fast storage
over a local network link (from PBS VM to the host).
Let it do multiple runs, but the variance was not that big, so here's
some representative log output with various MAX_BUFFERED_FUTURES values.

no MAX_BUFFERED_FUTURES: duration=43.18s, speed=758.82MB/s
4:                       duration=38.61s, speed=848.77MB/s
8:                       duration=33.79s, speed=969.85MB/s
16:                      duration=31.45s, speed=1042.06MB/s

note that increasing the number has a diminishing returns after 10-12 on
my system, but I guess it depends on the exact configuration. (For more
than 16 I did not see any improvement, but this is probably just my
setup).

I saw an increase in CPU usage (from ~75% to ~100% of one core), which
are very likely the additional chunks to be decoded.

In general I'd like to limit the buffering somehow, but I don't think
there is a good automatic metric we can use, and giving the admin a knob
that is hard to explain what the actual ramifications about it are is
also not good, so I settled for a value that showed improvement but does
not seem too high.

In any case, if the target and/or source storage is too slow, there will
be back/forward pressure, and this change should only matter for storage
systems where IO depth plays a role.

This patch is loosely based on the patch from Adam Kalisz[0], but removes
the need to increase the blocking threads and uses the (actually always
used) underlying async implementation for reading remote chunks.

0: 
https://lore.proxmox.com/pve-devel/mailman.719.1751052794.395.pve-de...@lists.proxmox.com/

Signed-off-by: Dominik Csapak <d.csa...@proxmox.com>
Based-on-patch-by: Adam Kalisz <adam.kal...@notnullmakers.com>
---
@Adam could you please test this patch too to see if you still see the
improvements you saw in your version?

Also I sent it as RFC to discuss how we decide how many chunks we want
to buffer/threads we want to allocate. This is a non-trivial topic, and
as i wrote we don't have a real metric to decide upfront, but giving the
admin knobs that are complicated is also not the best solution.

My instinct would be to simply increase to 16 (as I have done here) and
maybe expose this number in /etc/vzdump.conf or /etc/pve/datacenter.cfg

Also I tried to make the writes multi-threaded too, but my
QEMU-knowledge is not very deep for this kind of thing, and I wanted to
get this version out there soon. (Increasing the write threads can still
be done afterwards if this change is enough for now)

I developed this patch on top of the 'stable-bookworm' branch, but it
should apply cleanly on master as well.

 src/restore.rs | 57 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/src/restore.rs b/src/restore.rs
index 5a5a398..741b3e1 100644
--- a/src/restore.rs
+++ b/src/restore.rs
@@ -2,6 +2,7 @@ use std::convert::TryInto;
 use std::sync::{Arc, Mutex};
 
 use anyhow::{bail, format_err, Error};
+use futures::StreamExt;
 use once_cell::sync::OnceCell;
 use tokio::runtime::Runtime;
 
@@ -13,7 +14,7 @@ use pbs_datastore::cached_chunk_reader::CachedChunkReader;
 use pbs_datastore::data_blob::DataChunkBuilder;
 use pbs_datastore::fixed_index::FixedIndexReader;
 use pbs_datastore::index::IndexFile;
-use pbs_datastore::read_chunk::ReadChunk;
+use pbs_datastore::read_chunk::AsyncReadChunk;
 use pbs_datastore::BackupManifest;
 use pbs_key_config::load_and_decrypt_key;
 use pbs_tools::crypt_config::CryptConfig;
@@ -29,6 +30,9 @@ struct ImageAccessInfo {
     archive_size: u64,
 }
 
+// use at max 16 buffered futures to make loading of chunks more concurrent
+const MAX_BUFFERED_FUTURES: usize = 16;
+
 pub(crate) struct RestoreTask {
     setup: BackupSetup,
     runtime: Arc<Runtime>,
@@ -165,24 +169,47 @@ impl RestoreTask {
 
         let start_time = std::time::Instant::now();
 
-        for pos in 0..index.index_count() {
+        let read_queue = (0..index.index_count()).map(|pos| {
             let digest = index.index_digest(pos).unwrap();
             let offset = (pos * index.chunk_size) as u64;
-            if digest == &zero_chunk_digest {
-                let res = write_zero_callback(offset, index.chunk_size as u64);
-                if res < 0 {
-                    bail!("write_zero_callback failed ({})", res);
+            let chunk_reader = chunk_reader.clone();
+            async move {
+                let chunk = if digest == &zero_chunk_digest {
+                    None
+                } else {
+                    let raw_data = AsyncReadChunk::read_chunk(&chunk_reader, 
digest).await?;
+                    Some(raw_data)
+                };
+
+                Ok::<_, Error>((chunk, pos, offset))
+            }
+        });
+
+        // this buffers futures and pre-fetches some chunks for us
+        let mut stream = 
futures::stream::iter(read_queue).buffer_unordered(MAX_BUFFERED_FUTURES);
+
+        while let Some(res) = stream.next().await {
+            let res = res?;
+            let pos = match res {
+                (None, pos, offset) => {
+                    let res = write_zero_callback(offset, index.chunk_size as 
u64);
+                    if res < 0 {
+                        bail!("write_zero_callback failed ({})", res);
+                    }
+                    bytes += index.chunk_size;
+                    zeroes += index.chunk_size;
+                    pos
                 }
-                bytes += index.chunk_size;
-                zeroes += index.chunk_size;
-            } else {
-                let raw_data = ReadChunk::read_chunk(&chunk_reader, digest)?;
-                let res = write_data_callback(offset, &raw_data);
-                if res < 0 {
-                    bail!("write_data_callback failed ({})", res);
+                (Some(raw_data), pos, offset) => {
+                    let res = write_data_callback(offset, &raw_data);
+                    if res < 0 {
+                        bail!("write_data_callback failed ({})", res);
+                    }
+                    bytes += raw_data.len();
+                    pos
                 }
-                bytes += raw_data.len();
-            }
+            };
+
             if verbose {
                 let next_per = ((pos + 1) * 100) / index.index_count();
                 if per != next_per {
-- 
2.39.5



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

[pve-devel] [RFC PATCH proxmox-backup-qemu] restore: make chunk loading more parallel

Reply via email to