Re: [PATCH] [for-10.1] qapi/block-core: derpecate some block-job- APIs
Vladimir Sementsov-Ogievskiy writes: > On 04.04.25 17:13, Markus Armbruster wrote: [...] >> So, auto-finalize=true is silently ignored when another job in the same >> transaction has auto-finalize=false? > > Yes, at least, it looks like so: > > static void job_completed_txn_success_locked(Job *job) > { > > [...] > > /* If no jobs need manual finalization, automatically do so */ > if (job_txn_apply_locked(job, job_needs_finalize_locked) == 0) { > job_do_finalize_locked(job); > } > } Silently ignoring what the user specified is not okay whether the user's instructions make sense or not. Fixing this UI bug would break usage that relies on it. Should we care? [...]
[PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA
Full Unit Access (FUA) is an optimization where a disk write with the flag set will be persisted to disk immediately instead of potentially remaining in the disk's write cache. This commit address the todo task for using pwritev2() with RWF_DSYNC in the thread pool section of raw_co_prw(), if pwritev2 with RWF_DSYNC is available in the host, which is alway for Linux kernel >= 4.7. The intent for FUA is indicated with the BDRV_REQ_FUA flag. The old code paths are preserved in case BDRV_REQ_FUA is off or pwritev2() with RWF_DSYNC is not available. During testing, I observed that the BDRV_REQ_FUA is always turned on when blk->enable_write_cache is not set in block/block-backend.c, so I commented this section off during testing: https://gitlab.com/qemu-project/qemu/-/blob/master/block/block-backend.c?ref_type=heads#L1432-1434 Signed-off-by: Pinku Deb Nath --- block/file-posix.c | 42 ++ 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 56d1972d15..34de816eab 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -229,6 +229,7 @@ typedef struct RawPosixAIOData { unsigned long op; } zone_mgmt; }; +BdrvRequestFlags flags; } RawPosixAIOData; #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -1674,6 +1675,16 @@ qemu_pwritev(int fd, const struct iovec *iov, int nr_iov, off_t offset) return pwritev(fd, iov, nr_iov, offset); } +static ssize_t +qemu_pwrite_fua(int fd, const struct iovec *iov, int nr_iov, off_t offset) +{ +#ifdef RWF_DSYNC +return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC); +#else +return pwritev2(fd, iov, nr_iov, offset, 0); +#endif +} + #else static bool preadv_present = false; @@ -1698,10 +1709,15 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb) len = RETRY_ON_EINTR( (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ? -qemu_pwritev(aiocb->aio_fildes, - aiocb->io.iov, - aiocb->io.niov, - aiocb->aio_offset) : +(aiocb->flags & BDRV_REQ_FUA) ? +qemu_pwrite_fua(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset) : +qemu_pwritev(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset) : qemu_preadv(aiocb->aio_fildes, aiocb->io.iov, aiocb->io.niov, @@ -1727,10 +1743,17 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { -len = pwrite(aiocb->aio_fildes, - (const char *)buf + offset, - aiocb->aio_nbytes - offset, - aiocb->aio_offset + offset); +if (aiocb->flags & BDRV_REQ_FUA) { +len = qemu_pwrite_fua(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset); +} else { +len = pwrite(aiocb->aio_fildes, +(const char *)buf + offset, +aiocb->aio_nbytes - offset, +aiocb->aio_offset + offset); +} } else { len = pread(aiocb->aio_fildes, buf + offset, @@ -2539,14 +2562,17 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr, .iov= qiov->iov, .niov = qiov->niov, }, +.flags = flags, }; assert(qiov->size == bytes); ret = raw_thread_pool_submit(handle_aiocb_rw, &acb); +#ifndef RWD_DSYNC if (ret == 0 && (flags & BDRV_REQ_FUA)) { /* TODO Use pwritev2() instead if it's available */ ret = raw_co_flush_to_disk(bs); } +#endif goto out; /* Avoid the compiler err of unused label */ out: -- 2.43.0
[PATCH 07/11] docs/sphinx/qmp_lexer: Highlight elisions like comments, not prompts
Signed-off-by: Markus Armbruster --- docs/sphinx/qmp_lexer.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sphinx/qmp_lexer.py b/docs/sphinx/qmp_lexer.py index 1bd1b81b70..7b3b808d12 100644 --- a/docs/sphinx/qmp_lexer.py +++ b/docs/sphinx/qmp_lexer.py @@ -24,7 +24,7 @@ class QMPExampleMarkersLexer(RegexLexer): 'root': [ (r'-> ', token.Generic.Prompt), (r'<- ', token.Generic.Prompt), -(r'\.{3}( .* \.{3})?', token.Generic.Prompt), +(r'\.{3}( .* \.{3})?', token.Comment.Multiline), ] } -- 2.48.1
Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x
On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote: > Hi, > > Recently on s390x we have enabled mmap support for vfio-pci devices [1]. Hi Alex, I wanted to bring this to your attention. Feel free to merge it through the VFIO tree, otherwise I will merge it once you have taken a look. Thanks, Stefan > This allows us to take advantage and use userspace drivers on s390x. However, > on s390x we have special instructions for MMIO access. Starting with z15 > (and newer platforms) we have new PCI Memory I/O (MIO) instructions which > operate on virtually mapped PCI memory spaces, and can be used from userspace. > On older platforms we would fallback to using existing system calls for MMIO > access. > > This patch series introduces support the PCI MIO instructions, and enables > s390x > support for the userspace NVMe driver on s390x. I would appreciate any > review/feedback > on the patches. > > Thanks > Farhan > > [1] > https://lore.kernel.org/linux-s390/20250226-vfio_pci_mmap-v7-0-c5c0f1d26...@linux.ibm.com/ > > ChangeLog > - > v2 series https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06847.html > v2 -> v3 > - Update the PCI MMIO APIs to reflect that its PCI MMIO access on host > as suggested by Stefan(patch 2) > - Move s390x ifdef check to s390x_pci_mmio.h as suggested by Philippe > (patch 1) > - Add R-bs for the respective patches. > > v1 series https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06596.html > v1 -> v2 > - Add 8 and 16 bit reads/writes for completeness (patch 1) > - Introduce new QEMU PCI MMIO read/write API as suggested by Stefan > (patch 2) > - Update NVMe userspace driver to use QEMU PCI MMIO functions (patch 3) > > Farhan Ali (3): > util: Add functions for s390x mmio read/write > include: Add a header to define host PCI MMIO functions > block/nvme: Use host PCI MMIO API > > block/nvme.c | 37 + > include/qemu/host-pci-mmio.h | 116 ++ > include/qemu/s390x_pci_mmio.h | 24 ++ > util/meson.build | 2 + > util/s390x_pci_mmio.c | 148 ++ > 5 files changed, 311 insertions(+), 16 deletions(-) > create mode 100644 include/qemu/host-pci-mmio.h > create mode 100644 include/qemu/s390x_pci_mmio.h > create mode 100644 util/s390x_pci_mmio.c > > -- > 2.43.0 > signature.asc Description: PGP signature
Re: [PATCH v3 0/2] qapi/throttle: Fix qmp_block_set_io_throttle blocked for too long
On Wed, Mar 26, 2025 at 05:26:32PM +0800, zoudongjie wrote: > From: Zhu Yangyang > > Calling qmp_block_set_io_throttle() will be blocked for a long time > when a network disk is configured and the network failure is just about > to occur. > > This series add a timeout parameter for qmp_block_set_io_throttle to control > its execution duration. > > Changelog > v3 --- > Unify AIO_WAIT_WHILE_{TIMEOUT/INTERNAL} by replacing > AIO_WAIT_WHILE_INTERNAL() with > AIO_WAIT_WHILE_TIMEOUT(..., 0). > > v2 > 1. Support 0 in BDRV_POLL_WHILE_TIMEOUT(), 0 means infinite. > 2. Use uint64_t timeout_ns instead of int64 timeout to name variables. > 3. Use timer_pending() to check for expiry instead of explicitly checking > against the deadline for BDRV_POLL_WHILE_TIMEOUT(). > 4. Add documentation for bdrv_drained_begin_timeout(), note that > bdrv_drained_end() > must be called when -ETIMEDOUT is returned. > 5. Add a timeout parameter to the qmp_block_set_io_throttle() instead of > hardcoding > the timeout, and the default value is 0, mean an infinite timeout. > > v1 patch link: > https://lore.kernel.org/qemu-devel/20250308101618.721954-1-zoudong...@huawei.com/ > > Zhu Yangyang (2): > io/block: Refactoring the bdrv_drained_begin() function and implement > a timeout mechanism. > qapi/throttle: add timeout parameter for qmp_block_set_io_throttle() > > block/block-backend.c | 14 - > block/io.c | 58 + > block/qapi-system.c | 10 +++- > include/block/aio-wait.h| 47 - > include/block/block-io.h| 22 +++- > include/system/block-backend-global-state.h | 1 + > qapi/block-core.json| 5 +- > util/aio-wait.c | 5 ++ > 8 files changed, 135 insertions(+), 27 deletions(-) > > -- > 2.33.0 > Aside from Markus' comments: Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [PATCH v2 2/4] docs, qapi: generate undocumented return sections
On Tue, Apr 1, 2025 at 2:07 AM Markus Armbruster wrote: > John Snow writes: > > > On Thu, Mar 27, 2025 at 5:11 AM Markus Armbruster > wrote: > > > >> John Snow writes: > >> > >> > This patch changes the qapidoc transmogrifier to generate Return value > >> > documentation for any command that has a return value but hasn't > >> > explicitly documented that return value. > >> > > >> > Signed-off-by: John Snow > >> > >> Might want to briefly explain placement of the auto-generated return > >> value documentation. But before we discuss that any further, let's > >> review the actual changes the the generated docs. > >> > >> This patch adds auto-generated return value documentation where we have > >> none. > >> > >> The next patch replaces handwritten by auto-generated return value > >> documentation where these are at least as good. Moves the return value > >> docs in some cases. > >> > >> First the additions: > >> > >> * x-debug-query-block-graph > >> > >> Title, intro, features, return > >> > >> * query-tpm > >> > >> Title, intro, return, example > >> > >> * query-dirty-rate > >> > >> Title, intro, arguments, return, examples > >> > >> * query-vcpu-dirty-limit > >> > >> Title, intro, return, example > >> > >> * query-vm-generation-id > >> > >> Title, return > >> > >> * query-memory-size-summary > >> > >> Title, intro, example, return > >> > >> * query-memory-devices > >> > >> Title, intro, return, example > >> > >> * query-acpi-ospm-status > >> > >> Title, intro, return, example > >> > >> * query-stats-schemas > >> > >> Title, intro, arguments, note, return > >> > >> Undesirable: > >> > >> * query-memory-size-summary has returns after the example instead of > >> before. I figure it needs the TODO hack to separate intro and example > >> (see announce-self). > >> > > > > Yes > > > > > >> > >> * query-stats-schemas has a note between arguments and return. I think > >> this demonstrates that the placement algorithm is too simplistic. > >> > > > > Yeah ... It's just that *glances at length of email* it's been a > sensitive > > topic without a lot of certainty in desire, so I've tried to keep things > on > > the stupid/simple side ... > > When the best solution is unclear, starting discussion with a simplistic > solution is smart. Beats starting with a complicated solution that gets > shot down. > > >> Debatable: > >> > >> * x-debug-query-block-graph has returns after features. I'd prefer > >> returns before features. No need to debate this now. > >> > > > > Willing to do this for you if you'd like to enforce it globally. I'd be > > happy with any mandated order of sections, really. > > Could a more rigid order help the inliner, too? > Oh, absolutely. My series that adds it still assumes a rigid ordering. It just never enforced it. I don't really care what the order even is, just as long as there is one. > > >> Next the movements: > >> > >> * x-debug-block-dirty-bitmap-sha256 > >> > >> From right before errors to right after > >> > >> * blockdev-snapshot-delete-internal-sync > >> > >> From right before errors to right after > >> > >> * query-xen-replication-status > >> > >> From between intro and example to the end > >> > >> * query-colo-status > >> > >> From between intro and example to the end > >> > >> * query-balloon > >> > >> From right before errors to right after > >> > >> * query-hv-balloon-status-report > >> > >> From right before errors to right after > >> > >> * query-yank > >> > >> From between intro and example to the end > >> > >> * add-fd > >> > >> From between arguments and errors to between last note and example > >> > >> I don't like any of these :) > >> > > > > So it goes ... > > > > > >> > >> Undesirable: > >> > >> * query-xen-replication-status, query-yank, and query-colo-status now > >> have return after the example instead of before. I figure they now > >> need the TODO hack to separate intro and example. > >> > > > > Yes > > > > > >> > >> * add-fd now has a note between arguments and return. Same placement > >> weakness as for query-stats above. > >> > >> Debatable: > >> > >> * x-debug-block-dirty-bitmap-sha256, > >> blockdev-snapshot-delete-internal-sync, query-colo-status, and > >> query-hv-balloon-status-report now have return after errors instead of > >> before. I'd prefer before. > >> > >> What's the stupidest acceptable placement algorithm? Maybe this one: > >> > >> 1. If we have arguments, return goes right after them. > >> > >> 2. Else if we have errors, return goes right before them. > >> > >> 3. Else if we have features, return goes right before them. > >> > >> 4. Else return goes right after the intro (to make this work, we need > >>a few more TODO hacks). > >> > >> Would you be willing to give this a try? > >> > > > > Probably ... > > > > So this algorithm seems to imply a preference for this ordering: > > > > 1. Intro > > 2. Arguments > > 3. Return > > 4. Errors > > 5. Features > > 6. Details > > > > Do
[PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()
Pull setting up and tearing down the AIO context handlers into two dedicated functions. Signed-off-by: Hanna Czenczek --- block/export/fuse.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/block/export/fuse.c b/block/export/fuse.c index 2df6297d61..bd98809d71 100644 --- a/block/export/fuse.c +++ b/block/export/fuse.c @@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque); static bool is_regular_file(const char *path, Error **errp); -static void fuse_export_drained_begin(void *opaque) +static void fuse_attach_handlers(FuseExport *exp) { -FuseExport *exp = opaque; +aio_set_fd_handler(exp->common.ctx, + fuse_session_fd(exp->fuse_session), + read_from_fuse_export, NULL, NULL, NULL, exp); +exp->fd_handler_set_up = true; +} +static void fuse_detach_handlers(FuseExport *exp) +{ aio_set_fd_handler(exp->common.ctx, fuse_session_fd(exp->fuse_session), NULL, NULL, NULL, NULL, NULL); exp->fd_handler_set_up = false; } +static void fuse_export_drained_begin(void *opaque) +{ +fuse_detach_handlers(opaque); +} + static void fuse_export_drained_end(void *opaque) { FuseExport *exp = opaque; /* Refresh AioContext in case it changed */ exp->common.ctx = blk_get_aio_context(exp->common.blk); - -aio_set_fd_handler(exp->common.ctx, - fuse_session_fd(exp->fuse_session), - read_from_fuse_export, NULL, NULL, NULL, exp); -exp->fd_handler_set_up = true; +fuse_attach_handlers(exp); } static bool fuse_export_drained_poll(void *opaque) @@ -209,11 +216,7 @@ static int fuse_export_create(BlockExport *blk_exp, g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL); -aio_set_fd_handler(exp->common.ctx, - fuse_session_fd(exp->fuse_session), - read_from_fuse_export, NULL, NULL, NULL, exp); -exp->fd_handler_set_up = true; - +fuse_attach_handlers(exp); return 0; fail: @@ -329,10 +332,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp) fuse_session_exit(exp->fuse_session); if (exp->fd_handler_set_up) { -aio_set_fd_handler(exp->common.ctx, - fuse_session_fd(exp->fuse_session), - NULL, NULL, NULL, NULL, NULL); -exp->fd_handler_set_up = false; +fuse_detach_handlers(exp); } } -- 2.48.1
Re: [PATCH v9 4/7] qapi: add blockdev-replace command
On 18.10.24 16:59, Kevin Wolf wrote: If we want to get rid of the union, I think the best course of action would unifying the namespaces (so that nodes, exports and devices can't share the same ID) and then we could just accept a universal 'id' along with 'child'. Maybe we can go this way even without explicit restriction (which should some how go through deprecation period, etc), but simply look for the id among nodes, devices and exports and if found more than one parent - fail. And we document, that id should not be ambiguous, should not match more than one parent object. So, those who want to use new command will care to make unique ids. -- Best regards, Vladimir
[PATCH v2 4/4] qapi: rephrase return docs to avoid type name
Well, I tried. Maybe not very hard. Sorry! Signed-off-by: John Snow --- qapi/block-core.json | 6 +++--- qapi/block-export.json | 2 +- qapi/block.json | 2 +- qapi/control.json| 5 ++--- qapi/dump.json | 5 ++--- qapi/introspect.json | 6 +++--- qapi/job.json| 2 +- qapi/machine-target.json | 7 +++ qapi/misc-target.json| 2 +- qapi/misc.json | 5 ++--- qapi/net.json| 2 +- qapi/pci.json| 2 +- qapi/qdev.json | 3 +-- qapi/qom.json| 8 +++- qapi/stats.json | 2 +- qapi/trace.json | 2 +- qapi/ui.json | 2 +- qapi/virtio.json | 6 +++--- 18 files changed, 31 insertions(+), 38 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 92b2e368b72..eb97b70cd80 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -763,7 +763,7 @@ # # Get a list of BlockInfo for all virtual block devices. # -# Returns: a list of @BlockInfo describing each virtual block device. +# Returns: a list describing each virtual block device. # Filter nodes that were created implicitly are skipped over. # # Since: 0.14 @@ -1168,7 +1168,7 @@ # nodes that were created implicitly are skipped over in this # mode. (Since 2.3) # -# Returns: A list of @BlockStats for each virtual block devices. +# Returns: A list of statistics for each virtual block device. # # Since: 0.14 # @@ -1440,7 +1440,7 @@ # # Return information about long-running block device operations. # -# Returns: a list of @BlockJobInfo for each active block job +# Returns: a list of job info for each active block job # # Since: 1.1 ## diff --git a/qapi/block-export.json b/qapi/block-export.json index c783e01a532..84852606e52 100644 --- a/qapi/block-export.json +++ b/qapi/block-export.json @@ -472,7 +472,7 @@ ## # @query-block-exports: # -# Returns: A list of BlockExportInfo describing all block exports +# Returns: A list describing all block exports # # Since: 5.2 ## diff --git a/qapi/block.json b/qapi/block.json index e6f5c64..bdbbe78854f 100644 --- a/qapi/block.json +++ b/qapi/block.json @@ -86,7 +86,7 @@ # Returns a list of information about each persistent reservation # manager. # -# Returns: a list of @PRManagerInfo for each persistent reservation +# Returns: a list of manager info for each persistent reservation # manager # # Since: 3.0 diff --git a/qapi/control.json b/qapi/control.json index 336386f79e1..2e45bf25df8 100644 --- a/qapi/control.json +++ b/qapi/control.json @@ -93,8 +93,7 @@ # # Returns the current version of QEMU. # -# Returns: A @VersionInfo object describing the current version of -# QEMU. +# Returns: An object describing the current version of QEMU. # # Since: 0.14 # @@ -131,7 +130,7 @@ # # Return a list of supported QMP commands by this server # -# Returns: A list of @CommandInfo for all supported commands +# Returns: A list of all supported commands # # Since: 0.14 # diff --git a/qapi/dump.json b/qapi/dump.json index d7826c0e323..1bd6bacc5ce 100644 --- a/qapi/dump.json +++ b/qapi/dump.json @@ -146,7 +146,7 @@ # # Query latest dump status. # -# Returns: A @DumpStatus object showing the dump status. +# Returns: An object showing the dump status. # # Since: 2.6 # @@ -197,8 +197,7 @@ # # Returns the available formats for dump-guest-memory # -# Returns: A @DumpGuestMemoryCapability object listing available -# formats for dump-guest-memory +# Returns: An object listing available formats for dump-guest-memory # # Since: 2.0 # diff --git a/qapi/introspect.json b/qapi/introspect.json index 01bb242947c..7daec5045fb 100644 --- a/qapi/introspect.json +++ b/qapi/introspect.json @@ -34,10 +34,10 @@ # string into a specific enum or from one specific type into an # alternate that includes the original type alongside something else. # -# Returns: array of @SchemaInfo, where each element describes an -# entity in the ABI: command, event, type, ... +# Returns: an array where each element describes an entity in the ABI: +# command, event, type, ... # -# The order of the various SchemaInfo is unspecified; however, all +# The order of the various elements is unspecified; however, all # names are guaranteed to be unique (no name will be duplicated # with different meta-types). # diff --git a/qapi/job.json b/qapi/job.json index cfc3beedd21..856dd688f95 100644 --- a/qapi/job.json +++ b/qapi/job.json @@ -269,7 +269,7 @@ # # Return information about jobs. # -# Returns: a list with a @JobInfo for each active job +# Returns: a list with info for each active job # # Since: 3.0 ## diff --git a/qapi/machine-target.json b/qapi/machine-target.json index 37e75520094..6d8a6e53436 100644 --- a/qapi/machine-target.json +++ b/qapi/machine-target.json @@ -162,8 +162,7 @@ # @modelb: description of the second CPU model to compare, referred to # as "model B" in CpuModelCompareResult # -#
[PATCH v2 1/2] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA
Full Unit Access (FUA) is an optimization where a disk write with the flag set will be persisted to disk immediately instead of potentially remaining in the disk's write cache. This commit address the todo task for using pwritev2() with RWF_DSYNC in the thread pool section of raw_co_prw(), if pwritev2 with RWF_DSYNC is available in the host, which is alway for Linux kernel >= 4.7. The intent for FUA is indicated with the BDRV_REQ_FUA flag. The old code paths are preserved in case BDRV_REQ_FUA is off or pwritev2() with RWF_DSYNC is not available. During testing, I observed that the BDRV_REQ_FUA is always turned on when blk->enable_write_cache is not set in block/block-backend.c, so I commented this section off during testing: https://gitlab.com/qemu-project/qemu/-/blob/master/block/block-backend.c?ref_type=heads#L1432-1434 Signed-off-by: Pinku Deb Nath --- block/file-posix.c | 42 ++ 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 56d1972d15..34de816eab 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -229,6 +229,7 @@ typedef struct RawPosixAIOData { unsigned long op; } zone_mgmt; }; +BdrvRequestFlags flags; } RawPosixAIOData; #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -1674,6 +1675,16 @@ qemu_pwritev(int fd, const struct iovec *iov, int nr_iov, off_t offset) return pwritev(fd, iov, nr_iov, offset); } +static ssize_t +qemu_pwrite_fua(int fd, const struct iovec *iov, int nr_iov, off_t offset) +{ +#ifdef RWF_DSYNC +return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC); +#else +return pwritev2(fd, iov, nr_iov, offset, 0); +#endif +} + #else static bool preadv_present = false; @@ -1698,10 +1709,15 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb) len = RETRY_ON_EINTR( (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ? -qemu_pwritev(aiocb->aio_fildes, - aiocb->io.iov, - aiocb->io.niov, - aiocb->aio_offset) : +(aiocb->flags & BDRV_REQ_FUA) ? +qemu_pwrite_fua(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset) : +qemu_pwritev(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset) : qemu_preadv(aiocb->aio_fildes, aiocb->io.iov, aiocb->io.niov, @@ -1727,10 +1743,17 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { -len = pwrite(aiocb->aio_fildes, - (const char *)buf + offset, - aiocb->aio_nbytes - offset, - aiocb->aio_offset + offset); +if (aiocb->flags & BDRV_REQ_FUA) { +len = qemu_pwrite_fua(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset); +} else { +len = pwrite(aiocb->aio_fildes, +(const char *)buf + offset, +aiocb->aio_nbytes - offset, +aiocb->aio_offset + offset); +} } else { len = pread(aiocb->aio_fildes, buf + offset, @@ -2539,14 +2562,17 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, int64_t *offset_ptr, .iov= qiov->iov, .niov = qiov->niov, }, +.flags = flags, }; assert(qiov->size == bytes); ret = raw_thread_pool_submit(handle_aiocb_rw, &acb); +#ifndef RWD_DSYNC if (ret == 0 && (flags & BDRV_REQ_FUA)) { /* TODO Use pwritev2() instead if it's available */ ret = raw_co_flush_to_disk(bs); } +#endif goto out; /* Avoid the compiler err of unused label */ out: -- 2.43.0
Re: [PATCH 01/15] fuse: Copy write buffer content before polling
On Tue, Mar 25, 2025 at 05:06:35PM +0100, Hanna Czenczek wrote: > Polling in I/O functions can lead to nested read_from_fuse_export() > calls, overwriting the request buffer's content. The only function > affected by this is fuse_write(), which therefore must use a bounce > buffer or corruption may occur. > > Note that in addition we do not know whether libfuse-internal structures > can cope with this nesting, and even if we did, we probably cannot rely > on it in the future. This is the main reason why we want to remove > libfuse from the I/O path. > > @@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, > const char *buf, > size_t size, off_t offset, struct fuse_file_info *fi) > { > FuseExport *exp = fuse_req_userdata(req); > +void *copied; Do we have a good way to annotate variables that require qemu_vfree() if non-NULL for automatic cleanup? If so, should this be annotated and set to NULL here,... > int64_t length; > int ret; > > @@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, > const char *buf, > return; > } > > +/* > + * Heed the note on read_from_fuse_export(): If we poll (which any > blk_*() > + * I/O function may do), read_from_fuse_export() may be nested, > overwriting > + * the request buffer content. Therefore, we must copy it here. > + */ > +copied = blk_blockalign(exp->common.blk, size); > +memcpy(copied, buf, size); > + > /** > * Clients will expect short writes at EOF, so we have to limit > * offset+size to the image length. > @@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, > const char *buf, > length = blk_getlength(exp->common.blk); > if (length < 0) { > fuse_reply_err(req, -length); > -return; > +goto free_buffer; ...so that this and similar hunks are not needed? -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org
[PATCH v4 0/3] qapi: Fix some command blocked for too long
From: Zhu Yangyang QMP may will be blocked for a long time on bdrv_drained_begin() when a network disk is configured and the network failure is just about to occur. In theory, any command may be blocked if it calls bdrv_drained_begin(). This series add a timeout parameter for qmp_block_set_io_throttle() and qmp_block_resize() to control their execution duration. Changelog v4 --- 1. Corrected the added version of the timeout parameter from 10.0 to 10.1 2. Add a patch to add the timeout parameter for qmp_block_resize(). v3 --- Unify AIO_WAIT_WHILE_{TIMEOUT/INTERNAL} by replacing AIO_WAIT_WHILE_INTERNAL() with AIO_WAIT_WHILE_TIMEOUT(..., 0). v2 --- 1. Support 0 in BDRV_POLL_WHILE_TIMEOUT(), 0 means infinite. 2. Use uint64_t timeout_ns instead of int64 timeout to name variables. 3. Use timer_pending() to check for expiry instead of explicitly checking against the deadline for BDRV_POLL_WHILE_TIMEOUT(). 4. Add documentation for bdrv_drained_begin_timeout(), note that bdrv_drained_end() must be called when -ETIMEDOUT is returned. 5. Add a timeout parameter to the qmp_block_set_io_throttle() instead of hardcoding the timeout, and the default value is 0, mean an infinite timeout. v1 patch link: https://lore.kernel.org/qemu-devel/20250308101618.721954-1-zoudong...@huawei.com/ Zhu Yangyang (3): io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism. qapi/throttle: add timeout parameter for qmp_block_set_io_throttle() qapi/resize: add timeout parameter for qmp_block_resize() block/block-backend.c | 14 - block/io.c | 58 + block/monitor/block-hmp-cmds.c | 2 +- block/qapi-system.c | 10 +++- blockdev.c | 16 +- include/block/aio-wait.h| 47 - include/block/block-io.h| 22 +++- include/system/block-backend-global-state.h | 1 + qapi/block-core.json| 11 +++- util/aio-wait.c | 5 ++ 10 files changed, 155 insertions(+), 31 deletions(-) -- 2.33.0
Re: [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal
On Tue, Mar 25, 2025 at 05:06:42PM +0100, Hanna Czenczek wrote: > When exports are created on the command line (with the storage daemon), > errp is going to point to error_fatal. Without ERRP_GUARD, we would > exit immediately when *errp is set, i.e. skip the clean-up code under > the `fail` label. Use ERRP_GUARD so we always run that code. > > As far as I know, this has no actual impact right now[1], but it is > still better to make this right. > > [1] Not cleaning up the mount point is the only thing I can imagine > would be problematic, but that is the last thing we attempt, so if > it fails, it will clean itself up. > > Signed-off-by: Hanna Czenczek > --- > block/export/fuse.c | 1 + > 1 file changed, 1 insertion(+) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
[PULL 1/1] Revert "iotests: Stop NBD server in test 162 before starting the next one"
From: Thomas Huth This reverts commit e2668ba1ed44ad56f2f1653ff5f53b277d534fac. This commit made test 162 fail occasionally with: 162 fail [13:06:40] [13:06:40] 0.2s (last: 0.2s) output mismatch --- tests/qemu-iotests/162.out +++ tests/qemu-iotests/scratch/qcow2-file-162/162.out.bad @@ -3,6 +3,7 @@ === NBD === qemu-img: Could not open 'json:{"driver": "nbd", "host": -1}': address resolution failed for -1:10809: Name or service not known image: nbd://localhost:PORT +./common.rc: line 371: kill: (891116) - No such process image: nbd+unix://?socket=42 The nbd server should normally terminate automatically, so trying to kill it here now seems to cause a race that will cause a test failure when the server terminated before the kill command has been executed. The "Stop NBD server" patch has originally been written to solve another problem with a hanging nbd server, but since that problem has been properly solved by commit 3e1683485656, we now don't need the "_stop_nbd_server" here anymore. Reviewed-by: Hanna Czenczek Signed-off-by: Thomas Huth Message-ID: <20250326143533.932899-1-th...@redhat.com> Signed-off-by: Eric Blake --- tests/qemu-iotests/162 | 1 - 1 file changed, 1 deletion(-) diff --git a/tests/qemu-iotests/162 b/tests/qemu-iotests/162 index 956c2c5f339..94dae60d304 100755 --- a/tests/qemu-iotests/162 +++ b/tests/qemu-iotests/162 @@ -65,7 +65,6 @@ done $QEMU_IMG info "json:{'driver': 'nbd', 'host': 'localhost', 'port': $port}" \ | grep '^image' | sed -e "s/$port/PORT/" -_stop_nbd_server # This is a test for NBD's bdrv_refresh_filename() implementation: It expects # either host or path to be set, but it must not assume that they are set to -- 2.48.1
[PATCH v2 1/3] util: Add functions for s390x mmio read/write
Starting with z15 (or newer) we can execute mmio instructions from userspace. On older platforms where we don't have these instructions available we can fallback to using system calls to access the PCI mapped resources. This patch adds helper functions for mmio reads and writes for s390x. Reviewed-by: Stefan Hajnoczi Signed-off-by: Farhan Ali --- include/qemu/s390x_pci_mmio.h | 23 ++ util/meson.build | 2 + util/s390x_pci_mmio.c | 148 ++ 3 files changed, 173 insertions(+) create mode 100644 include/qemu/s390x_pci_mmio.h create mode 100644 util/s390x_pci_mmio.c diff --git a/include/qemu/s390x_pci_mmio.h b/include/qemu/s390x_pci_mmio.h new file mode 100644 index 00..aead791475 --- /dev/null +++ b/include/qemu/s390x_pci_mmio.h @@ -0,0 +1,23 @@ +/* + * s390x PCI MMIO definitions + * + * Copyright 2025 IBM Corp. + * Author(s): Farhan Ali + * + * SPDX-License-Identifier: GPL-2.0-or-later + */ +#ifndef S390X_PCI_MMIO_H +#define S390X_PCI_MMIO_H + +uint8_t s390x_pci_mmio_read_8(const void *ioaddr); +uint16_t s390x_pci_mmio_read_16(const void *ioaddr); +uint32_t s390x_pci_mmio_read_32(const void *ioaddr); +uint64_t s390x_pci_mmio_read_64(const void *ioaddr); + +void s390x_pci_mmio_write_8(void *ioaddr, uint8_t val); +void s390x_pci_mmio_write_16(void *ioaddr, uint16_t val); +void s390x_pci_mmio_write_32(void *ioaddr, uint32_t val); +void s390x_pci_mmio_write_64(void *ioaddr, uint64_t val); + + +#endif diff --git a/util/meson.build b/util/meson.build index 780b5977a8..acb21592f9 100644 --- a/util/meson.build +++ b/util/meson.build @@ -131,4 +131,6 @@ elif cpu in ['ppc', 'ppc64'] util_ss.add(files('cpuinfo-ppc.c')) elif cpu in ['riscv32', 'riscv64'] util_ss.add(files('cpuinfo-riscv.c')) +elif cpu == 's390x' + util_ss.add(files('s390x_pci_mmio.c')) endif diff --git a/util/s390x_pci_mmio.c b/util/s390x_pci_mmio.c new file mode 100644 index 00..820458a026 --- /dev/null +++ b/util/s390x_pci_mmio.c @@ -0,0 +1,148 @@ +/* + * s390x PCI MMIO definitions + * + * Copyright 2025 IBM Corp. + * Author(s): Farhan Ali + * + * SPDX-License-Identifier: GPL-2.0-or-later + */ + +#include "qemu/osdep.h" +#include +#include +#include "qemu/s390x_pci_mmio.h" +#include "elf.h" + +union register_pair { +unsigned __int128 pair; +struct { +uint64_t even; +uint64_t odd; +}; +}; + +static bool is_mio_supported; + +static __attribute__((constructor)) void check_is_mio_supported(void) +{ +is_mio_supported = !!(qemu_getauxval(AT_HWCAP) & HWCAP_S390_PCI_MIO); +} + +static uint64_t s390x_pcilgi(const void *ioaddr, size_t len) +{ +union register_pair ioaddr_len = { .even = (uint64_t)ioaddr, + .odd = len }; +uint64_t val; +int cc; + +asm volatile( +/* pcilgi */ +".insn rre,0xb9d6,%[val],%[ioaddr_len]\n" +"ipm %[cc]\n" +"srl %[cc],28\n" +: [cc] "=d"(cc), [val] "=d"(val), +[ioaddr_len] "+&d"(ioaddr_len.pair) :: "cc"); + +if (cc) { +val = -1ULL; +} + +return val; +} + +static void s390x_pcistgi(void *ioaddr, uint64_t val, size_t len) +{ +union register_pair ioaddr_len = {.even = (uint64_t)ioaddr, .odd = len}; + +asm volatile ( +/* pcistgi */ +".insn rre,0xb9d4,%[val],%[ioaddr_len]\n" +: [ioaddr_len] "+&d" (ioaddr_len.pair) +: [val] "d" (val) +: "cc", "memory"); +} + +uint8_t s390x_pci_mmio_read_8(const void *ioaddr) +{ +uint8_t val = 0; + +if (is_mio_supported) { +val = s390x_pcilgi(ioaddr, sizeof(val)); +} else { +syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val)); +} +return val; +} + +uint16_t s390x_pci_mmio_read_16(const void *ioaddr) +{ +uint16_t val = 0; + +if (is_mio_supported) { +val = s390x_pcilgi(ioaddr, sizeof(val)); +} else { +syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val)); +} +return val; +} + +uint32_t s390x_pci_mmio_read_32(const void *ioaddr) +{ +uint32_t val = 0; + +if (is_mio_supported) { +val = s390x_pcilgi(ioaddr, sizeof(val)); +} else { +syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val)); +} +return val; +} + +uint64_t s390x_pci_mmio_read_64(const void *ioaddr) +{ +uint64_t val = 0; + +if (is_mio_supported) { +val = s390x_pcilgi(ioaddr, sizeof(val)); +} else { +syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val)); +} +return val; +} + +void s390x_pci_mmio_write_8(void *ioaddr, uint8_t val) +{ +if (is_mio_supported) { +s390x_pcistgi(ioaddr, val, sizeof(val)); +} else { +syscall(__NR_s390_pci_mmio_write, ioaddr, &val, sizeof(val)); +} +} + +void s390x_pci_mmio_write_16(void *ioaddr, uint16_t val) +{ +if (is_mio_supported) { +s390x_pcistgi(ioaddr, val, sizeof(val)); +} else { +syscall(__NR_s3
[PATCH v2 1/2] io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism.
From: Zhu Yangyang The bdrv_drained_begin() function is a blocking function. In scenarios where network storage is used and network links fail, it may block for a long time. Therefore, we add a timeout parameter to control the duration of the block. Since bdrv_drained_begin() has been widely adopted, both bdrv_drained_begin() and bdrv_drained_begin_timeout() will be retained. Signed-off-by: Zhu Yangyang --- block/io.c | 58 +--- include/block/aio-wait.h | 49 + include/block/block-io.h | 22 ++- util/aio-wait.c | 5 4 files changed, 123 insertions(+), 11 deletions(-) diff --git a/block/io.c b/block/io.c index 1ba8d1aeea..912b76c4a4 100644 --- a/block/io.c +++ b/block/io.c @@ -255,6 +255,8 @@ typedef struct { bool begin; bool poll; BdrvChild *parent; +uint64_t timeout_ns; +int ret; } BdrvCoDrainData; /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */ @@ -283,6 +285,10 @@ static bool bdrv_drain_poll_top_level(BlockDriverState *bs, return bdrv_drain_poll(bs, ignore_parent, false); } +static int bdrv_do_drained_begin_timeout(BlockDriverState *bs, + BdrvChild *parent, + bool poll, + uint64_t timeout_ns); static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, bool poll); static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent); @@ -296,7 +302,9 @@ static void bdrv_co_drain_bh_cb(void *opaque) if (bs) { bdrv_dec_in_flight(bs); if (data->begin) { -bdrv_do_drained_begin(bs, data->parent, data->poll); +data->ret = bdrv_do_drained_begin_timeout(bs, data->parent, + data->poll, + data->timeout_ns); } else { assert(!data->poll); bdrv_do_drained_end(bs, data->parent); @@ -310,10 +318,11 @@ static void bdrv_co_drain_bh_cb(void *opaque) aio_co_wake(co); } -static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs, -bool begin, -BdrvChild *parent, -bool poll) +static int coroutine_fn bdrv_co_yield_to_drain_timeout(BlockDriverState *bs, + bool begin, + BdrvChild *parent, + bool poll, + uint64_t timeout_ns) { BdrvCoDrainData data; Coroutine *self = qemu_coroutine_self(); @@ -329,6 +338,8 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs, .begin = begin, .parent = parent, .poll = poll, +.timeout_ns = timeout_ns, +.ret = 0 }; if (bs) { @@ -342,16 +353,27 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs, /* If we are resumed from some other event (such as an aio completion or a * timer callback), it is a bug in the caller that should be fixed. */ assert(data.done); +return data.ret; } -static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, - bool poll) +static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs, +bool begin, +BdrvChild *parent, +bool poll) +{ +bdrv_co_yield_to_drain_timeout(bs, begin, parent, poll, 0); +} + +static int bdrv_do_drained_begin_timeout(BlockDriverState *bs, + BdrvChild *parent, + bool poll, + uint64_t timeout_ns) { IO_OR_GS_CODE(); if (qemu_in_coroutine()) { -bdrv_co_yield_to_drain(bs, true, parent, poll); -return; +return bdrv_co_yield_to_drain_timeout(bs, true, parent, poll, + timeout_ns); } GLOBAL_STATE_CODE(); @@ -375,8 +397,17 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, * nodes. */ if (poll) { -BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent)); +return BDRV_POLL_WHILE_TIMEOUT(bs, + bdrv_drain_poll_top_level(bs, parent), + timeout_ns); } +return 0; +} + +static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, + bool
Re: [PATCH 04/15] fuse: Explicitly set inode ID to 1
On Tue, Mar 25, 2025 at 05:06:44PM +0100, Hanna Czenczek wrote: > Setting .st_ino to the FUSE inode ID is kind of arbitrary. While in > practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because > we only have the root inode, that is not obvious in fuse_getattr(). > > Just explicitly set it to 1 (i.e. no functional change). > > Signed-off-by: Hanna Czenczek > --- > block/export/fuse.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
[PATCH v1 0/2] Enable QEMU NVMe userspace driver on s390x
Hi, Recently on s390x we have enabled mmap support for vfio-pci devices [1]. This allows us to take advantage and use userspace drivers on s390x. However, on s390x we have special instructions for MMIO access. Starting with z15 (and newer platforms) we have new PCI Memory I/O (MIO) instructions which operate on virtually mapped PCI memory spaces, and can be used from userspace. On older platforms we would fallback to using existing system calls for MMIO access. This patch series introduces support the PCI MIO instructions, and enables s390x support for the userspace NVMe driver on s390x. I would appreciate any review/feedback on the patches. Thanks Farhan [1] https://lore.kernel.org/linux-s390/20250226-vfio_pci_mmap-v7-0-c5c0f1d26...@linux.ibm.com/ Farhan Ali (2): util: Add functions for s390x mmio read/write block/nvme: Enable NVMe userspace driver for s390x block/nvme.c | 95 -- include/qemu/s390x_pci_mmio.h | 17 ++ util/meson.build | 2 + util/s390x_pci_mmio.c | 105 ++ 4 files changed, 201 insertions(+), 18 deletions(-) create mode 100644 include/qemu/s390x_pci_mmio.h create mode 100644 util/s390x_pci_mmio.c -- 2.43.0
Re: [PATCH v2 1/2] io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism.
On Fri, Mar 21, 2025 at 03:09:16PM +0800, zoudongjie wrote: > From: Zhu Yangyang > > The bdrv_drained_begin() function is a blocking function. In scenarios where > network storage > is used and network links fail, it may block for a long time. > Therefore, we add a timeout parameter to control the duration of the block. > > Since bdrv_drained_begin() has been widely adopted, both bdrv_drained_begin() > and bdrv_drained_begin_timeout() will be retained. > > Signed-off-by: Zhu Yangyang > --- > block/io.c | 58 +--- > include/block/aio-wait.h | 49 + > include/block/block-io.h | 22 ++- > util/aio-wait.c | 5 > 4 files changed, 123 insertions(+), 11 deletions(-) > > diff --git a/block/io.c b/block/io.c > index 1ba8d1aeea..912b76c4a4 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -255,6 +255,8 @@ typedef struct { > bool begin; > bool poll; > BdrvChild *parent; > +uint64_t timeout_ns; > +int ret; > } BdrvCoDrainData; > > /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */ > @@ -283,6 +285,10 @@ static bool bdrv_drain_poll_top_level(BlockDriverState > *bs, > return bdrv_drain_poll(bs, ignore_parent, false); > } > > +static int bdrv_do_drained_begin_timeout(BlockDriverState *bs, > + BdrvChild *parent, > + bool poll, > + uint64_t timeout_ns); > static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, >bool poll); > static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent); > @@ -296,7 +302,9 @@ static void bdrv_co_drain_bh_cb(void *opaque) > if (bs) { > bdrv_dec_in_flight(bs); > if (data->begin) { > -bdrv_do_drained_begin(bs, data->parent, data->poll); > +data->ret = bdrv_do_drained_begin_timeout(bs, data->parent, > + data->poll, > + data->timeout_ns); > } else { > assert(!data->poll); > bdrv_do_drained_end(bs, data->parent); > @@ -310,10 +318,11 @@ static void bdrv_co_drain_bh_cb(void *opaque) > aio_co_wake(co); > } > > -static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs, > -bool begin, > -BdrvChild *parent, > -bool poll) > +static int coroutine_fn bdrv_co_yield_to_drain_timeout(BlockDriverState *bs, > + bool begin, > + BdrvChild *parent, > + bool poll, > + uint64_t timeout_ns) > { > BdrvCoDrainData data; > Coroutine *self = qemu_coroutine_self(); > @@ -329,6 +338,8 @@ static void coroutine_fn > bdrv_co_yield_to_drain(BlockDriverState *bs, > .begin = begin, > .parent = parent, > .poll = poll, > +.timeout_ns = timeout_ns, > +.ret = 0 > }; > > if (bs) { > @@ -342,16 +353,27 @@ static void coroutine_fn > bdrv_co_yield_to_drain(BlockDriverState *bs, > /* If we are resumed from some other event (such as an aio completion or > a > * timer callback), it is a bug in the caller that should be fixed. */ > assert(data.done); > +return data.ret; > } > > -static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent, > - bool poll) > +static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs, > +bool begin, > +BdrvChild *parent, > +bool poll) > +{ > +bdrv_co_yield_to_drain_timeout(bs, begin, parent, poll, 0); > +} > + > +static int bdrv_do_drained_begin_timeout(BlockDriverState *bs, > + BdrvChild *parent, > + bool poll, > + uint64_t timeout_ns) > { > IO_OR_GS_CODE(); > > if (qemu_in_coroutine()) { > -bdrv_co_yield_to_drain(bs, true, parent, poll); > -return; > +return bdrv_co_yield_to_drain_timeout(bs, true, parent, poll, > + timeout_ns); > } > > GLOBAL_STATE_CODE(); > @@ -375,8 +397,17 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, > BdrvChild *parent, > * nodes. > */ > if (poll) { > -BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent)); > +retur
[PATCH v5] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA
Full Unit Access (FUA) is an optimization where a disk write with the flag set will be persisted to disk immediately instead of potentially remaining in the disk's write cache. This commit address the todo task for using pwritev2() with RWF_DSYNC in the thread pool section of raw_co_prw(), if pwritev2() with RWF_DSYNC is available in the host, which is always the case for Linux kernel >= 4.7. The intent for FUA is indicated with the BDRV_REQ_FUA flag. The old code paths are preserved in case BDRV_REQ_FUA is off or pwritev2() with RWF_DSYNC is not available. Support for disk writes with FUA is handled in qemu_pwritev_fua(), which uses pwritev2() with RWF_DSYNC if available, otherwise falls back to pwritev2() with no flags followed by flush using handle_aiocb_flush(). If pwritev2() is not implemented, then disk write in the linear FUA will fallback to pwrite() + handle_aiocb_flush(). Signed-off-by: Pinku Deb Nath --- v4: - Add fallback when qemu_pwritev_fua() returns ENOSYS - Similar fallback was not added for handle_aiocb_rw_vector() since there is a preadv_present check in handle_aiocb_rw() v3: - Changed signature to add fd, iov, nr_iov - Return -ENOSYS for non-Linux hosts v2: - Moved handle_aiocb_flush() into qemu_pwritev_fua() - In handle_aiocb_rw_linear(), iovec with iovcnt=1 is created based on the assumption that there will be only one buffer --- block/file-posix.c | 68 ++ 1 file changed, 56 insertions(+), 12 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 56d1972d15..59bed7866a 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -229,6 +229,7 @@ typedef struct RawPosixAIOData { unsigned long op; } zone_mgmt; }; +BdrvRequestFlags flags; } RawPosixAIOData; #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) @@ -1674,6 +1675,20 @@ qemu_pwritev(int fd, const struct iovec *iov, int nr_iov, off_t offset) return pwritev(fd, iov, nr_iov, offset); } +static ssize_t +qemu_pwritev_fua(int fd, struct iovec *iov, int nr_iov, off_t offset, const RawPosixAIOData *aiocb) +{ +#ifdef RWF_DSYNC +return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC); +#else +ssize_t len = pwritev2(fd, iov, nr_iov, offset, 0); +if (len == 0) { +len = handle_aiocb_flush(aiocb); +} +return len; +#endif +} + #else static bool preadv_present = false; @@ -1690,6 +1705,11 @@ qemu_pwritev(int fd, const struct iovec *iov, int nr_iov, off_t offset) return -ENOSYS; } +static ssize_t +qemu_pwritev_fua(int fd, struct iovec *iov, int nr_iov, off_t offset, const RawPosixAIOData *aiocb) +{ +return -ENOSYS; +} #endif static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb) @@ -1698,10 +1718,16 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb) len = RETRY_ON_EINTR( (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ? -qemu_pwritev(aiocb->aio_fildes, - aiocb->io.iov, - aiocb->io.niov, - aiocb->aio_offset) : +(aiocb->flags & BDRV_REQ_FUA) ? +qemu_pwritev_fua(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset, +aiocb) : +qemu_pwritev(aiocb->aio_fildes, +aiocb->io.iov, +aiocb->io.niov, +aiocb->aio_offset) : qemu_preadv(aiocb->aio_fildes, aiocb->io.iov, aiocb->io.niov, @@ -1727,10 +1753,31 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf) while (offset < aiocb->aio_nbytes) { if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { -len = pwrite(aiocb->aio_fildes, - (const char *)buf + offset, - aiocb->aio_nbytes - offset, - aiocb->aio_offset + offset); +if (aiocb->flags & BDRV_REQ_FUA) { +struct iovec iov = { +.iov_base = buf + offset, +.iov_len = aiocb->aio_nbytes - offset, +}; +len = qemu_pwritev_fua(aiocb->aio_fildes, +&iov, +1, +aiocb->aio_offset + offset, +aiocb); +if (len == -ENOSYS) { +len = pwrite(aiocb->aio_fildes, +(const char *)buf + offset, +aiocb->aio_nbytes - offset, +aiocb->aio_offset + offset); +if (len == 0) { +len = handle_aiocb_flush(aiocb); +
Re: Can I contribute Vitastor block driver? Or maybe introduce a QAPI plugin system?
Ok, thank you very much for the information! So I should first try to make my library available in one of the distros used for CI. As I understand these are Alpine, Ubuntu, Debian, Fedora, CentOS and OpenSUSE. OK, I'll try it first :) Regarding QAPI, I'd say it would be cool if it was allowed... I'm not sure if GPL circumvention is a real risk, because in some way it's already possible via the vhost-user protocol %). But it's up to you to decide of course :) > I'm not speaking for the QEMU project. I hope to be helpful anyway. I > am the QAPI maintainer, so my thoughts carry a bit more weight there. > > I understand your block driver depends on your libvitastor_client > library. > > Dependencies that aren't available on at least one supported build > platform (and thus covered by CI) are problematic. For Linux, > "available" generally means "provided by the distribution". I doubt > there's a way around getting your library packaged by distributions. > > The QAPI schema is (for better or worse) fixed at compile time by > design. Changing that would be a major undertaking. Whether the > benefits could justify the costs and risks seems rather doubtful to me. > > In my experience, the project invites contributions, not out-of-tree > extensions. The latter require external interfaces, which can only be > harder to maintain than internal ones. There's also the risk of abuse > to circumvent the GPL (I have absolutely no reason to assume you'd try > that!).
Re: [PATCH 14/15] fuse: Implement multi-threading
On Tue, Apr 01, 2025 at 03:36:40PM -0500, Eric Blake wrote: > On Thu, Mar 27, 2025 at 11:55:57AM -0400, Stefan Hajnoczi wrote: > > On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote: > > > FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs > > > (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)). > > > > > > We can use this to implement multi-threading. > > > > > > Note that the interface presented here differs from the multi-queue > > > interface of virtio-blk: The latter maps virtqueues to iothreads, which > > > allows processing multiple virtqueues in a single iothread. The > > > equivalent (processing multiple FDs in a single iothread) would not make > > > sense for FUSE because those FDs are used in a round-robin fashion by > > > the FUSE kernel driver. Putting two of them into a single iothread will > > > just create a bottleneck. > > > > This text might be outdated. virtio-blk's new iothread-vq-mapping > > parameter provides the "array of iothreads" mentioned below and a way to > > assign virtqueues to those IOThreads. > > > > > > +++ b/qapi/block-export.json > > > @@ -179,12 +179,18 @@ > > > # mount the export with allow_other, and if that fails, try again > > > # without. (since 6.1; default: auto) > > > # > > > +# @iothreads: Enables multi-threading: Handle requests in each of the > > > +# given iothreads (instead of the block device's iothread, or the > > > +# export's "main" iothread). For this, the FUSE FD is duplicated so > > > +# there is one FD per iothread. (since 10.1) > > > > This option isn't FUSE-specific but FUSE is the first export type to > > support it. Please add it to BlockExportOptions instead and refuse > > export creation when the export type only supports 1 IOThread. > > > > Eric: Are you interested in implementing support for multiple IOThreads > > in the NBD export? I remember some time ago we talked about NBD > > multi-conn support, although maybe that was for the client rather than > > the server. > > The NBD server already supports clients that make requests through > multiple TCP sockets, but right now, that server is not taking > advantage of iothreads to spread the TCP load. > > And yes, I am in the middle of working on adding client NBD multi-conn > support (reviving Rich Jones' preliminary patches on what it would > take to have qemu open parallel TCP sockets to a supporting NBD > server), which also will use a round-robin approach (but here, the > round-robin is something we would code up in qemu, rather than the > behavior handed to us by the FUSE kernel layer). Pinning specific > iothreads to a specific TCP socket may or may not make sense, but I > definitely want to have support for handing a pool of iothreads to an > NBD client that will be using multi-conn. Here I'm asking Hanna to make the "iothreads" export parameter generic for all export types (not just for FUSE). Do you think that the NBD export will be able to use the generic parameter or would you prefer an NBD-specific export parameter? Stefan signature.asc Description: PGP signature
Re: [PATCH 10/15] fuse: Add halted flag
On Tue, Mar 25, 2025 at 05:06:50PM +0100, Hanna Czenczek wrote: > This is a flag that we will want when processing FUSE requests > ourselves: When the kernel sends us e.g. a truncated request (i.e. we > receive less data than the request's indicated length), we cannot rely > on subsequent data to be valid. Then, we are going to set this flag, > halting all FUSE request processing. > > We plan to only use this flag in cases that would effectively be kernel > bugs. > > (Right now, the flag is unused because libfuse still does our request > processing.) > > Signed-off-by: Hanna Czenczek > --- > block/export/fuse.c | 30 ++ > 1 file changed, 30 insertions(+) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature