On 7/10/25 14:48, Dominik Csapak wrote:
[snip]
Just for the record i also benchmarked a slower system here:
6x16 TiB spinners in raid-10 with nvme special devices
over a 2.5 g link:
current approach is ~61 MiB/s restore speed
with my patch it's ~160MiB/s restore speed with not much increase
in cpu time (both were under 30% of a single core)
Also did perf stat for those to compare how much overhead the additional
futures/async/await
brings:
first restore:
62,871.24 msec task-clock # 0.115 CPUs
utilized
878,151 context-switches # 13.967 K/sec
28,205 cpu-migrations # 448.615 /sec
519,396 page-faults # 8.261 K/sec
277,239,999,474 cpu_core/cycles/ # 4.410 G/sec
(89.20%)
190,782,860,504 cpu_atom/cycles/ # 3.035 G/sec
(10.80%)
482,534,267,606 cpu_core/instructions/ # 7.675 G/sec
(89.20%)
188,659,352,613 cpu_atom/instructions/ # 3.001 G/sec
(10.80%)
46,913,925,346 cpu_core/branches/ # 746.191 M/sec
(89.20%)
19,251,496,445 cpu_atom/branches/ # 306.205 M/sec
(10.80%)
904,032,529 cpu_core/branch-misses/ # 14.379 M/sec
(89.20%)
621,228,739 cpu_atom/branch-misses/ # 9.881 M/sec
(10.80%)
1,633,142,624,469 cpu_core/slots/ # 25.976 G/sec
(89.20%)
489,311,603,992 cpu_core/topdown-retiring/ # 29.7% Retiring
(89.20%)
97,617,585,755 cpu_core/topdown-bad-spec/ # 5.9% Bad
Speculation (89.20%)
317,074,236,582 cpu_core/topdown-fe-bound/ # 19.2% Frontend
Bound (89.20%)
745,485,954,022 cpu_core/topdown-be-bound/ # 45.2% Backend
Bound (89.20%)
57,463,995,650 cpu_core/topdown-heavy-ops/ # 3.5% Heavy Operations # 26.2%
Light Operations (89.20%)
88,333,173,745 cpu_core/topdown-br-mispredict/ # 5.4% Branch Mispredict # 0.6%
Machine Clears (89.20%)
217,424,427,912 cpu_core/topdown-fetch-lat/ # 13.2% Fetch Latency # 6.0%
Fetch Bandwidth (89.20%)
354,250,103,398 cpu_core/topdown-mem-bound/ # 21.5% Memory Bound # 23.7%
Core Bound (89.20%)
548.195368256 seconds time elapsed
44.493218000 seconds user
21.315124000 seconds sys
second restore:
67,908.11 msec task-clock # 0.297 CPUs
utilized
856,402 context-switches # 12.611 K/sec
46,539 cpu-migrations # 685.323 /sec
942,002 page-faults # 13.872 K/sec
300,757,558,837 cpu_core/cycles/ # 4.429 G/sec
(75.93%)
234,595,451,063 cpu_atom/cycles/ # 3.455 G/sec
(24.07%)
511,747,593,432 cpu_core/instructions/ # 7.536 G/sec
(75.93%)
289,348,171,298 cpu_atom/instructions/ # 4.261 G/sec
(24.07%)
49,993,266,992 cpu_core/branches/ # 736.190 M/sec
(75.93%)
29,624,743,216 cpu_atom/branches/ # 436.248 M/sec
(24.07%)
911,770,988 cpu_core/branch-misses/ # 13.427 M/sec
(75.93%)
811,321,806 cpu_atom/branch-misses/ # 11.947 M/sec
(24.07%)
1,788,660,631,633 cpu_core/slots/ # 26.339 G/sec
(75.93%)
569,029,214,725 cpu_core/topdown-retiring/ # 31.4% Retiring
(75.93%)
125,815,987,213 cpu_core/topdown-bad-spec/ # 6.9% Bad
Speculation (75.93%)
234,249,755,030 cpu_core/topdown-fe-bound/ # 12.9% Frontend
Bound (75.93%)
885,539,445,254 cpu_core/topdown-be-bound/ # 48.8% Backend
Bound (75.93%)
86,825,030,719 cpu_core/topdown-heavy-ops/ # 4.8% Heavy Operations # 26.6%
Light Operations (75.93%)
116,566,866,551 cpu_core/topdown-br-mispredict/ # 6.4% Branch Mispredict # 0.5%
Machine Clears (75.93%)
135,276,276,904 cpu_core/topdown-fetch-lat/ # 7.5% Fetch Latency # 5.5%
Fetch Bandwidth (75.93%)
409,898,741,185 cpu_core/topdown-mem-bound/ # 22.6% Memory Bound # 26.2%
Core Bound (75.93%)
228.528573197 seconds time elapsed
48.379229000 seconds user
21.779166000 seconds sys
so the overhead for the additional futures was ~8% in cycles, ~6% in
instructions
which does not seem too bad
addendum:
the tests above did sadly run into a network limit of ~600MBit/s (still
trying to figure out where the bottleneck in the network is...)
tested again from a different machine that has a 10G link to the pbs mentioned
above.
This time i restored to the 'null-co' driver from qemu since the target storage
was too slow....
anyways, the results are:
current code: restore ~75MiB/s
16 way parallel: ~528MiB/s (7x !)
cpu usage went up from <50% of one core to ~350% (like in my initial tests with
a different setup)
perf stat output below:
current:
183,534.85 msec task-clock # 0.409 CPUs utilized
117,267 context-switches # 638.936 /sec
700 cpu-migrations # 3.814 /sec
462,432 page-faults # 2.520 K/sec
468,609,612,840 cycles # 2.553 GHz
1,286,188,699,253 instructions # 2.74 insn per
cycle
41,342,312,275 branches # 225.256 M/sec
846,432,249 branch-misses # 2.05% of all
branches
448.965517535 seconds time elapsed
152.007611000 seconds user
32.189942000 seconds sys
16 way parallel:
228,583.26 msec task-clock # 3.545 CPUs utilized
114,575 context-switches # 501.240 /sec
6,028 cpu-migrations # 26.371 /sec
1,561,179 page-faults # 6.830 K/sec
510,861,534,387 cycles # 2.235 GHz
1,296,819,542,686 instructions # 2.54 insn per
cycle
43,202,234,699 branches # 189.000 M/sec
828,196,795 branch-misses # 1.92% of all
branches
64.482868654 seconds time elapsed
184.172759000 seconds user
44.560342000 seconds sys
so still about ~8% more cycles, about the same amount of instructions but in
much less time
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel