> On 18 Apr 2025, at 17:50, Philippe Mathieu-Daudé <phi...@linaro.org> wrote:
> 
> Hi Nir,
> 
> On 18/4/25 16:24, Nir Soffer wrote:
>> Testing with qemu-nbd shows that computing a hash of an image via
>> qemu-nbd is 5-7 times faster with this change.
>> Tested with 2 qemu-nbd processes:
>>     $ ./qemu-nbd-after -r -t -e 0 -f raw -k /tmp/after.sock 
>> /var/tmp/bench/data-10g.img &
>>     $ ./qemu-nbd-before -r -t -e 0 -f raw -k /tmp/before.sock 
>> /var/tmp/bench/data-10g.img &
>> With nbdcopy, using 4 NBD connections:
>>     $ hyperfine -w 3 "./nbdcopy --blkhash 
>> 'nbd+unix:///?socket=/tmp/before.sock' null:"
>>                      "./nbdcopy --blkhash 
>> 'nbd+unix:///?socket=/tmp/after.sock' null:"
>>     Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/before.sock' 
>> null:
>>       Time (mean ± σ):      8.670 s ±  0.025 s    [User: 5.670 s, System: 
>> 7.113 s]
>>       Range (min … max):    8.620 s …  8.703 s    10 runs
>>     Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' 
>> null:
>>       Time (mean ± σ):      1.839 s ±  0.008 s    [User: 4.651 s, System: 
>> 1.882 s]
>>       Range (min … max):    1.830 s …  1.853 s    10 runs
>>     Summary
>>       ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/after.sock' null: ran
>>         4.72 ± 0.02 times faster than ./nbdcopy --blkhash 
>> 'nbd+unix:///?socket=/tmp/before.sock' null:
>> With blksum, using one NBD connection:
>>     $ hyperfine -w 3 "blksum 'nbd+unix:///?socket=/tmp/before.sock'" \
>>                      "blksum 'nbd+unix:///?socket=/tmp/after.sock'"
>>     Benchmark 1: blksum 'nbd+unix:///?socket=/tmp/before.sock'
>>       Time (mean ± σ):     13.606 s ±  0.081 s    [User: 5.799 s, System: 
>> 6.231 s]
>>       Range (min … max):   13.516 s … 13.785 s    10 runs
>>     Benchmark 2: blksum 'nbd+unix:///?socket=/tmp/after.sock'
>>       Time (mean ± σ):      1.946 s ±  0.017 s    [User: 4.541 s, System: 
>> 1.481 s]
>>       Range (min … max):    1.912 s …  1.979 s    10 runs
>>     Summary
>>       blksum 'nbd+unix:///?socket=/tmp/after.sock' ran
>>         6.99 ± 0.07 times faster than blksum 
>> 'nbd+unix:///?socket=/tmp/before.sock'
>> This will improve other usage of unix domain sockets on macOS, I tested
>> only qemu-nbd.
>> Signed-off-by: Nir Soffer <nir...@gmail.com>
>> ---
>>  io/channel-socket.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>> index 608bcf066e..b858659764 100644
>> --- a/io/channel-socket.c
>> +++ b/io/channel-socket.c
>> @@ -410,6 +410,19 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>>      }
>>  #endif /* WIN32 */
>>  +#if __APPLE__
>> +    /* On macOS we need to tune unix domain socket buffer for best 
>> performance.
>> +     * Apple recommends sizing the receive buffer at 4 times the size of the
>> +     * send buffer.
>> +     */
>> +    if (cioc->localAddr.ss_family == AF_UNIX) {
>> +        const int sndbuf_size = 1024 * 1024;
> 
> Please add a definition instead of magic value, i.e.:
> 
>  #define SOCKET_SEND_BUFSIZE  (1 * MiB)

Using 1 * MiB is nicer.

Not sure about the “magic” value; Do you mean:

    #define SOCKET_SEND_BUFSIZE  (1 * MiB)

In the top of the file or near the definition?

    const int sndbuf_size = 1 * MiB;

If we want it at the top of the file the name may be confusing since this is 
used only for macOS and for unix socket.

We can have:

    #define MACOS_UNIX_SOCKET_SEND_BUFSIZE (1 * MiB)

Or maybe:

    #if __APPLE__
    #define UNIX_SOCKET_SEND_BUFSIZE (1 * MiB)
    #endif

But we use this in one function so I’m not sure it helps.

In vmnet-helper I’m using this in 2 places so it moved to config.h.
https://github.com/nirs/vmnet-helper/blob/main/config.h.in

> 
> BTW in test_io_channel_set_socket_bufs() we use 64 KiB, why 1 MiB?

This test use small buffer size so we can see the effect of partial 
reads/writes. I’m trying to improve throughput when reading image data with 
qemu-nbd. This will likely improve also qemu-storage-daemon and qemu builtin 
nbd server but I did not test them.

I did some benchmarks with send buffer size 64k - 2m, and it shows that 1m 
gives the best performance.

Running one qemu-nbd process with each configuration:

% ps
...
18850 ttys013    2:01.78 ./qemu-nbd-64k -r -t -e 0 -f raw -k /tmp/64k.sock 
/Users/nir/bench/data-10g.img
18871 ttys013    1:53.49 ./qemu-nbd-128k -r -t -e 0 -f raw -k /tmp/128k.sock 
/Users/nir/bench/data-10g.img
18877 ttys013    1:47.95 ./qemu-nbd-256k -r -t -e 0 -f raw -k /tmp/256k.sock 
/Users/nir/bench/data-10g.img
18885 ttys013    1:52.06 ./qemu-nbd-512k -r -t -e 0 -f raw -k /tmp/512k.sock 
/Users/nir/bench/data-10g.img
18894 ttys013    2:02.34 ./qemu-nbd-1m -r -t -e 0 -f raw -k /tmp/1m.sock 
/Users/nir/bench/data-10g.img
22918 ttys013    0:00.02 ./qemu-nbd-2m -r -t -e 0 -f raw -k /tmp/2m.sock 
/Users/nir/bench/data-10g.img

% hyperfine -w 3 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock' 
null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock' 
null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock' 
null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock' 
null:” \
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null:” 
\
                 "./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:"
Benchmark 1: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/64k.sock' null:
  Time (mean ± σ):      2.760 s ±  0.014 s    [User: 4.871 s, System: 2.576 s]
  Range (min … max):    2.736 s …  2.788 s    10 runs
  Benchmark 2: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/128k.sock' null:
  Time (mean ± σ):      2.284 s ±  0.006 s    [User: 4.774 s, System: 2.044 s]
  Range (min … max):    2.275 s …  2.294 s    10 runs
  Benchmark 3: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/256k.sock' null:
  Time (mean ± σ):      2.036 s ±  0.010 s    [User: 4.734 s, System: 1.822 s]
  Range (min … max):    2.021 s …  2.052 s    10 runs
  Benchmark 4: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/512k.sock' null:
  Time (mean ± σ):      1.763 s ±  0.005 s    [User: 4.637 s, System: 1.801 s]
  Range (min … max):    1.755 s …  1.771 s    10 runs
  Benchmark 5: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null:
  Time (mean ± σ):      1.653 s ±  0.012 s    [User: 4.568 s, System: 1.818 s]
  Range (min … max):    1.636 s …  1.683 s    10 runs
  Benchmark 6: ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/2m.sock' null:
  Time (mean ± σ):      1.802 s ±  0.052 s    [User: 4.573 s, System: 1.918 s]
  Range (min … max):    1.736 s …  1.896 s    10 runs
  Summary
  ./nbdcopy --blkhash 'nbd+unix:///?socket=/tmp/1m.sock' null: ran
    1.07 ± 0.01 times faster than ./nbdcopy --blkhash 
'nbd+unix:///?socket=/tmp/512k.sock' null:
    1.09 ± 0.03 times faster than ./nbdcopy --blkhash 
'nbd+unix:///?socket=/tmp/2m.sock' null:
    1.23 ± 0.01 times faster than ./nbdcopy --blkhash 
'nbd+unix:///?socket=/tmp/256k.sock' null:
    1.38 ± 0.01 times faster than ./nbdcopy --blkhash 
'nbd+unix:///?socket=/tmp/128k.sock' null:
    1.67 ± 0.02 times faster than ./nbdcopy --blkhash 
'nbd+unix:///?socket=/tmp/64k.sock' null:

I can add a combat table showing the results in a comment, or add the test 
output to the commit message for reference.

> 
>> +        const int rcvbuf_size = 4 * sndbuf_size;
>> +        setsockopt(cioc->fd, SOL_SOCKET, SO_SNDBUF, &sndbuf_size, 
>> sizeof(sndbuf_size));
>> +        setsockopt(cioc->fd, SOL_SOCKET, SO_RCVBUF, &rcvbuf_size, 
>> sizeof(rcvbuf_size));
>> +    }
>> +#endif /* __APPLE__ */
> 
> Thanks,
> 
> Phil.



Reply via email to