> On 18 Apr 2025, at 17:50, Philippe Mathieu-Daudé <phi...@linaro.org> wrote:
>
> Hi Nir,
>
> On 18/4/25 16:24, Nir Soffer wrote:
>> Testing with qemu-nbd shows that computing a hash of an image via
>> qemu-nbd is 5-7 times faster with this change.
>> Tested with 2 qemu-nbd processes:
>> $ ./qemu-nbd-after -r -t -e 0 -f raw -k /tmp/after.sock
>> /var/tmp/bench/data-10g.img &
>> $ ./qemu-nbd-before -r -t -e 0 -f raw -k /tmp/before.sock
>> /var/tmp/bench/data-10g.img &
>> With nbdcopy, using 4 NBD connections:
>> $ hyperfine -w 3 "./nbdcopy --blkhash
>> 'nbd+unix:///?socket=/tmp/before.sock' null:"
>> "./nbdcopy --blkhash
>> 'nbd+unix:///?socket=/tmp/after.sock' null:"
>> Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock'
>> null:
>> Time (mean ± σ): 8.670 s ± 0.025 s [User: 5.670 s, System:
>> 7.113 s]
>> Range (min … max): 8.620 s … 8.703 s 10 runs
>> Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock'
>> null:
>> Time (mean ± σ): 1.839 s ± 0.008 s [User: 4.651 s, System:
>> 1.882 s]
>> Range (min … max): 1.830 s … 1.853 s 10 runs
>> Summary
>> ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null: ran
>> 4.72 ± 0.02 times faster than ./nbdcopy --blkhash
>> 'nbd+unix:///?socket=/tmp/before.sock' null:
>> With blksum, using one NBD connection:
>> $ hyperfine -w 3 "blksum 'nbd+unix:///?socket=/tmp/before.sock'" \
>> "blksum 'nbd+unix:///?socket=/tmp/after.sock'"
>> Benchmark 1: blksum 'nbd+unix:///?socket=/tmp/before.sock'
>> Time (mean ± σ): 13.606 s ± 0.081 s [User: 5.799 s, System:
>> 6.231 s]
>> Range (min … max): 13.516 s … 13.785 s 10 runs
>> Benchmark 2: blksum 'nbd+unix:///?socket=/tmp/after.sock'
>> Time (mean ± σ): 1.946 s ± 0.017 s [User: 4.541 s, System:
>> 1.481 s]
>> Range (min … max): 1.912 s … 1.979 s 10 runs
>> Summary
>> blksum 'nbd+unix:///?socket=/tmp/after.sock' ran
>> 6.99 ± 0.07 times faster than blksum
>> 'nbd+unix:///?socket=/tmp/before.sock'
>> This will improve other usage of unix domain sockets on macOS, I tested
>> only qemu-nbd.
>> Signed-off-by: Nir Soffer <nir...@gmail.com>
>> ---
>> io/channel-socket.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>> index 608bcf066e..b858659764 100644
>> --- a/io/channel-socket.c
>> +++ b/io/channel-socket.c
>> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>> }
>> #endif /* WIN32 */
>> +#if __APPLE__
>> + /* On macOS we need to tune unix domain socket buffer for best
>> performance.
>> + * Apple recommends sizing the receive buffer at 4 times the size of the
>> + * send buffer.
>> + */
>> + if (cioc->localAddr.ss_family == AF_UNIX) {
>> + const int sndbuf_size = 1024 * 1024;
>
> Please add a definition instead of magic value, i.e.:
>
> #define SOCKET_SEND_BUFSIZE (1 * MiB)
Using 1 * MiB is nicer.
Not sure about the “magic” value; Do you mean:
#define SOCKET_SEND_BUFSIZE (1 * MiB)
In the top of the file or near the definition?
const int sndbuf_size = 1 * MiB;
If we want it at the top of the file the name may be confusing since this is
used only for macOS and for unix socket.
We can have:
#define MACOS_UNIX_SOCKET_SEND_BUFSIZE (1 * MiB)
Or maybe:
#if __APPLE__
#define UNIX_SOCKET_SEND_BUFSIZE (1 * MiB)
#endif
But we use this in one function so I’m not sure it helps.
In vmnet-helper I’m using this in 2 places so it moved to config.h.
https://github.com/nirs/vmnet-helper/blob/main/config.h.in
>
> BTW in test_io_channel_set_socket_bufs() we use 64 KiB, why 1 MiB?
This test use small buffer size so we can see the effect of partial
reads/writes. I’m trying to improve throughput when reading image data with
qemu-nbd. This will likely improve also qemu-storage-daemon and qemu builtin
nbd server but I did not test them.
I did some benchmarks with send buffer size 64k - 2m, and it shows that 1m
gives the best performance.
Running one qemu-nbd process with each configuration:
% ps
...
18850 ttys013 2:01.78 ./qemu-nbd-64k -r -t -e 0 -f raw -k /tmp/64k.sock
/Users/nir/bench/data-10g.img
18871 ttys013 1:53.49 ./qemu-nbd-128k -r -t -e 0 -f raw -k /tmp/128k.sock
/Users/nir/bench/data-10g.img
18877 ttys013 1:47.95 ./qemu-nbd-256k -r -t -e 0 -f raw -k /tmp/256k.sock
/Users/nir/bench/data-10g.img
18885 ttys013 1:52.06 ./qemu-nbd-512k -r -t -e 0 -f raw -k /tmp/512k.sock
/Users/nir/bench/data-10g.img
18894 ttys013 2:02.34 ./qemu-nbd-1m -r -t -e 0 -f raw -k /tmp/1m.sock
/Users/nir/bench/data-10g.img
22918 ttys013 0:00.02 ./qemu-nbd-2m -r -t -e 0 -f raw -k /tmp/2m.sock
/Users/nir/bench/data-10g.img
% hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock'
null:” \
"./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock'
null:” \
"./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock'
null:” \
"./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock'
null:” \
"./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null:”
\
"./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:"
Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock' null:
Time (mean ± σ): 2.760 s ± 0.014 s [User: 4.871 s, System: 2.576 s]
Range (min … max): 2.736 s … 2.788 s 10 runs
Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock' null:
Time (mean ± σ): 2.284 s ± 0.006 s [User: 4.774 s, System: 2.044 s]
Range (min … max): 2.275 s … 2.294 s 10 runs
Benchmark 3: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock' null:
Time (mean ± σ): 2.036 s ± 0.010 s [User: 4.734 s, System: 1.822 s]
Range (min … max): 2.021 s … 2.052 s 10 runs
Benchmark 4: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock' null:
Time (mean ± σ): 1.763 s ± 0.005 s [User: 4.637 s, System: 1.801 s]
Range (min … max): 1.755 s … 1.771 s 10 runs
Benchmark 5: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null:
Time (mean ± σ): 1.653 s ± 0.012 s [User: 4.568 s, System: 1.818 s]
Range (min … max): 1.636 s … 1.683 s 10 runs
Benchmark 6: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:
Time (mean ± σ): 1.802 s ± 0.052 s [User: 4.573 s, System: 1.918 s]
Range (min … max): 1.736 s … 1.896 s 10 runs
Summary
./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null: ran
1.07 ± 0.01 times faster than ./nbdcopy --blkhash
'nbd+unix:///?socket=/tmp/512k.sock' null:
1.09 ± 0.03 times faster than ./nbdcopy --blkhash
'nbd+unix:///?socket=/tmp/2m.sock' null:
1.23 ± 0.01 times faster than ./nbdcopy --blkhash
'nbd+unix:///?socket=/tmp/256k.sock' null:
1.38 ± 0.01 times faster than ./nbdcopy --blkhash
'nbd+unix:///?socket=/tmp/128k.sock' null:
1.67 ± 0.02 times faster than ./nbdcopy --blkhash
'nbd+unix:///?socket=/tmp/64k.sock' null:
I can add a combat table showing the results in a comment, or add the test
output to the commit message for reference.
>
>> + const int rcvbuf_size = 4 * sndbuf_size;
>> + setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size,
>> sizeof(sndbuf_size));
>> + setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size,
>> sizeof(rcvbuf_size));
>> + }
>> +#endif /* __APPLE__ */
>
> Thanks,
>
> Phil.