Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

Dominik Csapak Fri, 11 Jul 2025 01:22:30 -0700

On 7/10/25 14:48, Dominik Csapak wrote:
[snip]


Just for the record i also benchmarked a slower system here:
6x16 TiB spinners in raid-10 with nvme special devices
over a 2.5 g link:

current approach is ~61 MiB/s restore speed
with my patch it's ~160MiB/s restore speed with not much increase
in cpu time (both were under 30% of a single core)

Also did perf stat for those to compare how much overhead the additional 
futures/async/await
brings:


first restore:

         62,871.24 msec task-clock                       #    0.115 CPUs 
utilized
           878,151      context-switches                 #   13.967 K/sec
            28,205      cpu-migrations                   #  448.615 /sec
           519,396      page-faults                      #    8.261 K/sec
   277,239,999,474      cpu_core/cycles/                 #    4.410 G/sec 
(89.20%)
   190,782,860,504      cpu_atom/cycles/                 #    3.035 G/sec 
(10.80%)
   482,534,267,606      cpu_core/instructions/           #    7.675 G/sec 
(89.20%)
   188,659,352,613      cpu_atom/instructions/           #    3.001 G/sec 
(10.80%)
    46,913,925,346      cpu_core/branches/               #  746.191 M/sec 
(89.20%)
    19,251,496,445      cpu_atom/branches/               #  306.205 M/sec 
(10.80%)
       904,032,529      cpu_core/branch-misses/          #   14.379 M/sec 
(89.20%)
       621,228,739      cpu_atom/branch-misses/          #    9.881 M/sec 
(10.80%)
1,633,142,624,469      cpu_core/slots/                  #   25.976 G/sec        
            (89.20%)
   489,311,603,992      cpu_core/topdown-retiring/       #     29.7% Retiring 
(89.20%)
    97,617,585,755      cpu_core/topdown-bad-spec/       #      5.9% Bad 
Speculation (89.20%)
   317,074,236,582      cpu_core/topdown-fe-bound/       #     19.2% Frontend 
Bound (89.20%)
   745,485,954,022      cpu_core/topdown-be-bound/       #     45.2% Backend 
Bound (89.20%)

57,463,995,650 cpu_core/topdown-heavy-ops/ # 3.5% Heavy Operations # 26.2%Light Operations (89.20%) 88,333,173,745 cpu_core/topdown-br-mispredict/ # 5.4% Branch Mispredict # 0.6%Machine Clears (89.20%) 217,424,427,912 cpu_core/topdown-fetch-lat/ # 13.2% Fetch Latency # 6.0%Fetch Bandwidth (89.20%) 354,250,103,398 cpu_core/topdown-mem-bound/ # 21.5% Memory Bound # 23.7%Core Bound (89.20%)



     548.195368256 seconds time elapsed


      44.493218000 seconds user
      21.315124000 seconds sys

second restore:

         67,908.11 msec task-clock                       #    0.297 CPUs 
utilized
           856,402      context-switches                 #   12.611 K/sec
            46,539      cpu-migrations                   #  685.323 /sec
           942,002      page-faults                      #   13.872 K/sec
   300,757,558,837      cpu_core/cycles/                 #    4.429 G/sec 
(75.93%)
   234,595,451,063      cpu_atom/cycles/                 #    3.455 G/sec 
(24.07%)
   511,747,593,432      cpu_core/instructions/           #    7.536 G/sec 
(75.93%)
   289,348,171,298      cpu_atom/instructions/           #    4.261 G/sec 
(24.07%)
    49,993,266,992      cpu_core/branches/               #  736.190 M/sec 
(75.93%)
    29,624,743,216      cpu_atom/branches/               #  436.248 M/sec 
(24.07%)
       911,770,988      cpu_core/branch-misses/          #   13.427 M/sec 
(75.93%)
       811,321,806      cpu_atom/branch-misses/          #   11.947 M/sec 
(24.07%)
1,788,660,631,633      cpu_core/slots/                  #   26.339 G/sec        
            (75.93%)
   569,029,214,725      cpu_core/topdown-retiring/       #     31.4% Retiring 
(75.93%)
   125,815,987,213      cpu_core/topdown-bad-spec/       #      6.9% Bad 
Speculation (75.93%)
   234,249,755,030      cpu_core/topdown-fe-bound/       #     12.9% Frontend 
Bound (75.93%)
   885,539,445,254      cpu_core/topdown-be-bound/       #     48.8% Backend 
Bound (75.93%)

86,825,030,719 cpu_core/topdown-heavy-ops/ # 4.8% Heavy Operations # 26.6%Light Operations (75.93%) 116,566,866,551 cpu_core/topdown-br-mispredict/ # 6.4% Branch Mispredict # 0.5%Machine Clears (75.93%) 135,276,276,904 cpu_core/topdown-fetch-lat/ # 7.5% Fetch Latency # 5.5%Fetch Bandwidth (75.93%) 409,898,741,185 cpu_core/topdown-mem-bound/ # 22.6% Memory Bound # 26.2%Core Bound (75.93%)



     228.528573197 seconds time elapsed


      48.379229000 seconds user
      21.779166000 seconds sys


so the overhead for the additional futures was ~8%  in cycles, ~6% in 
instructions
which does not seem too bad


addendum:

the tests above did sadly run into a network limit of ~600MBit/s (still
trying to figure out where the bottleneck in the network is...)

tested again from a different machine that has a 10G link to the pbs mentioned 
above.
This time i restored to the 'null-co' driver from qemu since the target storage 
was too slow....

anyways, the results are:

current code: restore ~75MiB/s
16 way parallel: ~528MiB/s (7x !)

cpu usage went up from <50% of one core to ~350% (like in my initial tests with 
a different setup)

perf stat output below:

current:

       183,534.85 msec task-clock                       #    0.409 CPUs utilized
          117,267      context-switches                 #  638.936 /sec
              700      cpu-migrations                   #    3.814 /sec
          462,432      page-faults                      #    2.520 K/sec
  468,609,612,840      cycles                           #    2.553 GHz
1,286,188,699,253      instructions                     #    2.74  insn per 
cycle
   41,342,312,275      branches                         #  225.256 M/sec
      846,432,249      branch-misses                    #    2.05% of all 
branches

    448.965517535 seconds time elapsed

    152.007611000 seconds user
     32.189942000 seconds sys

16 way parallel:

       228,583.26 msec task-clock                       #    3.545 CPUs utilized
          114,575      context-switches                 #  501.240 /sec
            6,028      cpu-migrations                   #   26.371 /sec
        1,561,179      page-faults                      #    6.830 K/sec
  510,861,534,387      cycles                           #    2.235 GHz
1,296,819,542,686      instructions                     #    2.54  insn per 
cycle
   43,202,234,699      branches                         #  189.000 M/sec
      828,196,795      branch-misses                    #    1.92% of all 
branches

     64.482868654 seconds time elapsed

    184.172759000 seconds user
     44.560342000 seconds sys

so still about ~8% more cycles, about the same amount of instructions but in 
much less time


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [RFC PATCH v2 proxmox-backup-qemu] restore: make chunk loading more parallel

Reply via email to