On 31/05/21 15:59, Kevin Wolf wrote:
Apparently the motivation for Maxim's patch was, if I'm reading the
description correctly, that it affected non-sg cases by imposing
unnecessary restrictions. I see that patch 1 changed the max_iov part so
that it won't affect non-sg cases any more, but max_transfer could still
be more restricted than necessary, no?
Indeed the kernel puts no limit at all, but especially with O_DIRECT we
probably benefit from avoiding the moral equivalent of "bufferbloat".
Yeah, that sounds plausible, but on the other hand the bug report Maxim
addressed was about performance issues related to buffer sizes being too
small. So even if we want to have some limit, max_transfer of the host
device is probably not the right one for the general case.
Yeah, for a simple dd with O_DIRECT there is no real max_transfer, and
if you are willing to allocate a big enough buffer. Quick test on my
laptop, reading 12.5 GiB:
163840 9.46777s
327680 9.41480s
520192 9.39520s (max_iov * 4K)
614400 9.06289s
655360 8.85762s
1310720 8.75502s
2621440 8.26522s
5242880 7.88319s
10485760 7.66751s
20971520 7.42627s
In practice using blktrace shows that virtual address space is
fragmented enough that the cap for I/O operations is not max_transfer
but max_iov * 4096 (as was before the series)... and yet the benefit
effectively *begins* there because it's where the cost of the system
calls is amortized over multiple kernel<->disk communications.
Things are probably more complicated if more than one I/O is in flight,
and with async I/O instead of read/write, but still a huge part of
performance is seemingly the cost of system calls (not just the context
switch, also pinning the I/O buffer and all other ancillary costs).
So the solution is probably to add a max_hw_transfer limit in addition
to max_transfer, and have max_hw_iov instead of max_iov to match.
Paolo