I also ran some benchmarks with the same script.
I created a VM with two virtual disks, (both on an LVM Thin storage)
installed PVE on one disk and set up an ext4 partition on the other.
I stopped pvestatd and pve-cluster,
```
systemctl stop pvestatd
systemctl stop pve-cluster
```
moved the pmxcfs database file to its own disk
```
mv /var/lib/pve-cluster/config.db /tmp/
mount /dev/sdb1 /var/lib/pve-cluster
mv /tmp/config.db /var/lib/pve-cluster/
```
and started pve-cluster again.
```
systemctl start pve-cluster
```
These are my results before and after applying the patch:
data size written (old) amplification (old) written (new)
amplification (new)
1 48 48 45 45
2 48 24 45 23
4 82 21 80 20
8 121 15 90 11
16 217 14 146 9
32 506 16 314 10
64 1472 23 826 13
128 5585 44 3765 29
256 20424 80 10743 42
512 86715 169 43650 85
1024 369568 361 187496 183
As can be seen, my results are similar with amplification really picking
up at 128k. The patch seems to cut write amplification in half with big
files, while making virtually no difference with files up to 4k.
On 23/09/2024 11:17, Dominik Csapak wrote:
On 9/19/24 16:57, Thomas Lamprecht wrote:
Am 19/09/2024 um 14:45 schrieb Dominik Csapak:
On 9/19/24 14:01, Thomas Lamprecht wrote:
Am 19/09/2024 um 11:52 schrieb Dominik Csapak:
by default libfuse2 limits writes to 4k size, which means that on
writes
bigger than that, we do a whole write cycle for each 4k block that
comes
in. To avoid that, add the option 'big_writes' to allow writes bigger
than 4k at once.
This should improve pmxcfs performance for situations where we often
write large files (e.g. big ha status) and maybe reduce writes to
disk.
Should? Something like before/after for benchmark numbers, flamegraphs
would be really good to have, without those it's rather hard to
discuss
this, and I'd like to avoid having to do those, or check the inner
workings
of the affected fuse userspace/kernel code paths here myself.
well I mean the code change is relatively small and the result is
rather clear:
Well sure the code change is just setting an option... But the actual
change is
abstracted away and would benefit from actually looking into..
in the current case we have the following calls from pmxcfs
(shortened for e-mail)
when writing a single 128k block:
(dd if=... of=/etc/pve/test bs=128k count=1)
Better than nothing but still no actual numbers (reduced time,
reduced write amp
in combination with sqlite, ...), some basic analysis over file/write
size distribution
on a single node and (e.g. three node) cluster, ...
If that's all obvious for you then great, but as already mentioned in
the past, I
want actual data in commit messages for such stuff, and I cannot
really see a downside
of having such numbers.
Again, as is I'm not really seeing what's to discuss, you send it as
RFC after
all.
[...]
so a factor of 32 less calls to cfs_fuse_write (including memdb_pwrite)
That can be huge or not so big at all, i.e. as mentioned above, it
would we good to
measure the impact through some other metrics.
And FWIW, I used bpftrace to count [0] with an unpatched pmxcfs,
there I get
the 32 calls to cfs_fuse_write only for a new file, overwriting the
existing
file again with the same amount of data (128k) just causes a single
call.
I tried using more data (e.g. from 128k initially to 256k or 512k)
and it's
always the data divided by 128k (even if the first file has a
different size)
We do not override existing files often, but rather write to a new
file and
then rename, but still quite interesting and IMO really showing that
just
because this is 1 +-1 line change it doesn't necessarily have to be
trivial
and obvious in its effects.
[0]: bpftrace -e 'u:cfs_fuse_write /str(args->path) == "/test"/ {@ =
count();} END { print(@) }' -p "$(pidof pmxcfs)"
If we'd change to libfuse3, this would be a non-issue, since that
option
got removed and is the default there.
I'd prefer that. At least if done with the future PVE 9.0, as I do
not think
it's a good idea in the middle of a stable release cycle.
why not this change now, and the rewrite to libfuse3 later? that way
we can
have some improvements now too...
Because I want some actual data and reasoning first, even if it's
quite likely
that this improves things Somehow™, I'd like to actually know in what
metrics
and by how much (even if just an upper bound due to the benchmark or
some
measurement being rather artificial).
I mean you name the big HA status, why not measure that for real?
like, probably
among other things, in terms of bytes hitting the block layer, i.e.
the actual
backing disk from those requests as then we'd know for real if this
can reduce
the write load there, not just that it maybe should.
hi,
first i just wanted to say I'm sorry for my snarky comment about not
needing to test
performance for such code. You're right, any insight we can gain there
is good and we (I!) should take the time to do that, even if the
change looks "obvious" like it does here
so i did some benchmarks (mostly disk writes) and wrote the short
script below
(maybe we can reuse that?)
----8<----
use strict;
use warnings;
use PVE::Tools;
my $size = shift;
sub get_bytes_written {
my $fh = IO::File->new("/proc/diskstats", "r");
die if !$fh;
my $bytes = undef;
while (defined(my $line = <$fh>)) {
if ($line =~ m/sdb/) {
my @fields = split(/\s+/, $line);
$bytes = $fields[10] * 512;
}
}
return $bytes;
}
sub test_write {
my ($k) = @_;
system("rm /etc/pve/testfile");
my $data = "a"x($k*1024);
system("sync; echo -n 3> /proc/sys/vm/drop_caches");
my $bytes_before = get_bytes_written();
PVE::Tools::file_set_contents("/etc/pve/testfile", $data);
system("sync; echo -n 3> /proc/sys/vm/drop_caches");
my $bytes_after = get_bytes_written();
return $bytes_after - $bytes_before;
}
$size //= 128;
my $written = test_write($size) / 1024;
print("$written\n");
---->8----
to simulate our real write patterns with varying file sizes
i installed a fresh pve, turned off pvestatd and put
/var/lib/pve-cluster on it's own disk
(/dev/sdb), so diskstats only contains writes from the pmxcfs
the results are below (all sizes are kbytes, ran them multiple times,
but they
seem to be consistent)
data size written (old) amplification (old) written (new)
amplification (new)
1 56 56 56 56
2 72 36 76 38
4 84 21 88 22
8 144 18 104 13
16 236 14 160 10
32 532 16 324 10
64 1496 23 836 13
128 6616 51 3848 30
256 20600 80 10568 41
512 87296 170 43416 84
1024 388460 379 197032 192
for smaller writes there seems to be a minimum overhead of ~50-100 kbytes
for big files with have a massive amplification of > 100x
my patch does seem to make a difference for files >4k
but the biggest surprise here is that the write amplification
is not linear, but increases with an increase in bytes we want to write
so e.g. going from 128k -> 256k file write we don't just write double
the amount,
but 3x to 4x as much.
i also produced a flamegraph according to
https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html, but that
showed virtually no change between versions without
and with my patch (if one has a good svg hoster, i can post them ofc)
so, tl;dr
for small writes does not make that much of a difference, but
we can save ~half the writes for large files in the pmxcfs
(which e.g. a ha-state file could do)
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel