On Tue, Mar 25, 2025 at 05:06:55PM +0100, Hanna Czenczek wrote: > We probably want to support larger write sizes than just 4k; 64k seems > nice. However, we cannot read partial requests from the FUSE FD, we > always have to read requests in full; so our read buffer must be large > enough to accommodate potential 64k writes if we want to support that. > > Always allocating FuseRequest objects with 64k buffers in them seems > wasteful, though. But we can get around the issue by splitting the > buffer into two and using readv(): One part will hold all normal (up to > 4k) write requests and all other requests, and a second part (the > "spill-over buffer") will be used only for larger write requests. Each > FuseQueue has its own spill-over buffer, and only if we find it used > when reading a request will we move its ownership into the FuseRequest > object and allocate a new spill-over buffer for the queue. > > This way, we get to support "large" write sizes without having to > allocate big buffers when they aren't used. > > Also, this even reduces the size of the FuseRequest objects because the > read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but > the requests we support are not quite so large (except for >4k writes), > so until now, we basically had to have useless padding in there. > > With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement > is easily met and we can decrease the size of the buffer portion that is > right inside of FuseRequest. > > As for benchmarks, the benefit of this patch can be shown easily by > writing a 4G image (with qemu-img convert) to a FUSE export: > - Before this patch: Takes 25.6 s (14.4 s with -t none) > - After this patch: Takes 4.5 s (5.5 s with -t none)
Hmm - 64k is still small for some tasks; with qcow2, you can have clusters up to 2M, and writing a cluster at a time seems to be a reasonable desire. Or in other storage areas, nbdcopy (from libnbd) defaults to 256k and allows up to 32M, or we can even look at lvm that defaults to extent size of 4M (although with lvm thin volumes, behavior is more like qcow2 with subclusters in that you don't have to allocate the entire extent at once). Maybe it makes sense to have this as a hierarchichal spillover (first level spills over to 64k, second level spills over to 2M or whatever ACTUAL maximum you are willing to support)? Or you can feel free to ignore my comments on this patch - allowing 64k instead of 4k is ALREADY a nice change. > @@ -501,7 +547,20 @@ static void coroutine_fn co_read_from_fuse_fd(void > *opaque) > goto no_request; > } > > - ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, > sizeof(q->request_buf))); > + /* > + * If handling the last request consumed the spill-over buffer, allocate > a > + * new one. Align it to the block device's alignment, which admittedly > is > + * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too. What are typical block device minimum alignments? If 4k is typical, then it should be easy to make FUSE_IN_PLACE_WRITE_BYTES be 4k. If 64k is typical, we're already doomed at being aligned. Is being unaligned going to cause the block layer to add bounce buffers on us, if the block layer has a larger alignment than what we use here? > @@ -915,17 +983,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t > offset, uint32_t size) > } > > /** > - * Handle client writes to the exported image. @buf has the data to be > written > - * and will be copied to a bounce buffer before yielding for the first time. > + * Handle client writes to the exported image. @in_place_buf has the first > + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf > + * contains the rest (if any; NULL otherwise). > + * Data in @in_place_buf is assumed to be overwritten after yielding, so will > + * be copied to a bounce buffer beforehand. @spillover_buf in contrast is > + * assumed to be exclusively owned and will be used as-is. Makes sense, although it leads to some interesting bookkeeping. (I'm wondering if nbd would benefit from this split-level buffering, since it supports reads and writes up to 32M) -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org