date:20250405

Re: [PATCH] [for-10.1] qapi/block-core: derpecate some block-job- APIs

2025-04-05 Thread Markus Armbruster

Vladimir Sementsov-Ogievskiy  writes:

> On 04.04.25 17:13, Markus Armbruster wrote:

[...]

>> So, auto-finalize=true is silently ignored when another job in the same
>> transaction has auto-finalize=false?
>
> Yes, at least, it looks like so:
>
> static void job_completed_txn_success_locked(Job *job)
> {
>
> [...]
>
>  /* If no jobs need manual finalization, automatically do so */
>  if (job_txn_apply_locked(job, job_needs_finalize_locked) == 0) {
>  job_do_finalize_locked(job);
>  }
> }

Silently ignoring what the user specified is not okay whether the user's
instructions make sense or not.

Fixing this UI bug would break usage that relies on it.  Should we care?

[...]

[PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA

2025-04-05 Thread Pinku Deb Nath

Full Unit Access (FUA) is an optimization where a disk write with the
flag set will be persisted to disk immediately instead of potentially
remaining in the disk's write cache. This commit address the todo task
for using pwritev2() with RWF_DSYNC in the thread pool section of
raw_co_prw(), if pwritev2 with RWF_DSYNC is available in the host,
which is alway for Linux kernel >= 4.7. The intent for FUA is indicated
with the BDRV_REQ_FUA flag. The old code paths are preserved in case
BDRV_REQ_FUA is off or pwritev2() with RWF_DSYNC is not available.

During testing, I observed that the BDRV_REQ_FUA is always turned on
when blk->enable_write_cache is not set in block/block-backend.c, so
I commented this section off during testing:
https://gitlab.com/qemu-project/qemu/-/blob/master/block/block-backend.c?ref_type=heads#L1432-1434

Signed-off-by: Pinku Deb Nath 
---
 block/file-posix.c | 42 ++
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 56d1972d15..34de816eab 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -229,6 +229,7 @@ typedef struct RawPosixAIOData {
 unsigned long op;
 } zone_mgmt;
 };
+BdrvRequestFlags flags;
 } RawPosixAIOData;
 
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -1674,6 +1675,16 @@ qemu_pwritev(int fd, const struct iovec *iov, int 
nr_iov, off_t offset)
 return pwritev(fd, iov, nr_iov, offset);
 }
 
+static ssize_t
+qemu_pwrite_fua(int fd, const struct iovec *iov, int nr_iov, off_t offset)
+{
+#ifdef RWF_DSYNC
+return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC);
+#else
+return pwritev2(fd, iov, nr_iov, offset, 0);
+#endif
+}
+
 #else
 
 static bool preadv_present = false;
@@ -1698,10 +1709,15 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 
 len = RETRY_ON_EINTR(
 (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
-qemu_pwritev(aiocb->aio_fildes,
-   aiocb->io.iov,
-   aiocb->io.niov,
-   aiocb->aio_offset) :
+(aiocb->flags &  BDRV_REQ_FUA) ?
+qemu_pwrite_fua(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset) :
+qemu_pwritev(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset) :
 qemu_preadv(aiocb->aio_fildes,
   aiocb->io.iov,
   aiocb->io.niov,
@@ -1727,10 +1743,17 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 
 while (offset < aiocb->aio_nbytes) {
 if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
-len = pwrite(aiocb->aio_fildes,
- (const char *)buf + offset,
- aiocb->aio_nbytes - offset,
- aiocb->aio_offset + offset);
+if (aiocb->flags & BDRV_REQ_FUA) {
+len = qemu_pwrite_fua(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset);
+} else {
+len = pwrite(aiocb->aio_fildes,
+(const char *)buf + offset,
+aiocb->aio_nbytes - offset,
+aiocb->aio_offset + offset);
+}
 } else {
 len = pread(aiocb->aio_fildes,
 buf + offset,
@@ -2539,14 +2562,17 @@ static int coroutine_fn raw_co_prw(BlockDriverState 
*bs, int64_t *offset_ptr,
 .iov= qiov->iov,
 .niov   = qiov->niov,
 },
+.flags  = flags,
 };
 
 assert(qiov->size == bytes);
 ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
+#ifndef RWD_DSYNC
 if (ret == 0 && (flags & BDRV_REQ_FUA)) {
 /* TODO Use pwritev2() instead if it's available */
 ret = raw_co_flush_to_disk(bs);
 }
+#endif
 goto out; /* Avoid the compiler err of unused label */
 
 out:
-- 
2.43.0

[PATCH 07/11] docs/sphinx/qmp_lexer: Highlight elisions like comments, not prompts

2025-04-05 Thread Markus Armbruster

Signed-off-by: Markus Armbruster 
---
 docs/sphinx/qmp_lexer.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sphinx/qmp_lexer.py b/docs/sphinx/qmp_lexer.py
index 1bd1b81b70..7b3b808d12 100644
--- a/docs/sphinx/qmp_lexer.py
+++ b/docs/sphinx/qmp_lexer.py
@@ -24,7 +24,7 @@ class QMPExampleMarkersLexer(RegexLexer):
 'root': [
 (r'-> ', token.Generic.Prompt),
 (r'<- ', token.Generic.Prompt),
-(r'\.{3}( .* \.{3})?', token.Generic.Prompt),
+(r'\.{3}( .* \.{3})?', token.Comment.Multiline),
 ]
 }
 
-- 
2.48.1

Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-05 Thread Stefan Hajnoczi

On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:
> Hi,
> 
> Recently on s390x we have enabled mmap support for vfio-pci devices [1].

Hi Alex,
I wanted to bring this to your attention. Feel free to merge it through
the VFIO tree, otherwise I will merge it once you have taken a look.

Thanks,
Stefan

> This allows us to take advantage and use userspace drivers on s390x. However,
> on s390x we have special instructions for MMIO access. Starting with z15 
> (and newer platforms) we have new PCI Memory I/O (MIO) instructions which 
> operate on virtually mapped PCI memory spaces, and can be used from userspace.
> On older platforms we would fallback to using existing system calls for MMIO 
> access.
> 
> This patch series introduces support the PCI MIO instructions, and enables 
> s390x
> support for the userspace NVMe driver on s390x. I would appreciate any 
> review/feedback
> on the patches.
> 
> Thanks
> Farhan
> 
> [1] 
> https://lore.kernel.org/linux-s390/20250226-vfio_pci_mmap-v7-0-c5c0f1d26...@linux.ibm.com/
> 
> ChangeLog
> -
> v2 series https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06847.html
> v2 -> v3
> - Update the PCI MMIO APIs to reflect that its PCI MMIO access on host 
> as suggested by Stefan(patch 2)
> - Move s390x ifdef check to s390x_pci_mmio.h as suggested by Philippe 
> (patch 1)
> - Add R-bs for the respective patches.
> 
> v1 series https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06596.html
> v1 -> v2
> - Add 8 and 16 bit reads/writes for completeness (patch 1)
> - Introduce new QEMU PCI MMIO read/write API as suggested by Stefan 
> (patch 2)
> - Update NVMe userspace driver to use QEMU PCI MMIO functions (patch 3)
> 
> Farhan Ali (3):
>   util: Add functions for s390x mmio read/write
>   include: Add a header to define host PCI MMIO functions
>   block/nvme: Use host PCI MMIO API
> 
>  block/nvme.c  |  37 +
>  include/qemu/host-pci-mmio.h  | 116 ++
>  include/qemu/s390x_pci_mmio.h |  24 ++
>  util/meson.build  |   2 +
>  util/s390x_pci_mmio.c | 148 ++
>  5 files changed, 311 insertions(+), 16 deletions(-)
>  create mode 100644 include/qemu/host-pci-mmio.h
>  create mode 100644 include/qemu/s390x_pci_mmio.h
>  create mode 100644 util/s390x_pci_mmio.c
> 
> -- 
> 2.43.0
> 


signature.asc
Description: PGP signature

Re: [PATCH v3 0/2] qapi/throttle: Fix qmp_block_set_io_throttle blocked for too long

2025-04-05 Thread Stefan Hajnoczi

On Wed, Mar 26, 2025 at 05:26:32PM +0800, zoudongjie wrote:
> From: Zhu Yangyang 
> 
> Calling qmp_block_set_io_throttle() will be blocked for a long time
> when a network disk is configured and the network failure is just about
> to occur.
> 
> This series add a timeout parameter for qmp_block_set_io_throttle to control
> its execution duration.
> 
> Changelog
> v3 ---
>   Unify AIO_WAIT_WHILE_{TIMEOUT/INTERNAL} by replacing 
> AIO_WAIT_WHILE_INTERNAL() with
>   AIO_WAIT_WHILE_TIMEOUT(..., 0).
> 
> v2 
>   1. Support 0 in BDRV_POLL_WHILE_TIMEOUT(), 0 means infinite.
>   2. Use uint64_t timeout_ns instead of int64 timeout to name variables.
>   3. Use timer_pending() to check for expiry instead of explicitly checking
>  against the deadline for BDRV_POLL_WHILE_TIMEOUT().
>   4. Add documentation for bdrv_drained_begin_timeout(), note that 
> bdrv_drained_end()
>  must be called when -ETIMEDOUT is returned.
>   5. Add a timeout parameter to the qmp_block_set_io_throttle() instead of 
> hardcoding
>  the timeout, and the default value is 0, mean an infinite timeout.
> 
> v1 patch link:
> https://lore.kernel.org/qemu-devel/20250308101618.721954-1-zoudong...@huawei.com/
> 
> Zhu Yangyang (2):
>   io/block: Refactoring the bdrv_drained_begin() function and implement
> a timeout mechanism.
>   qapi/throttle: add timeout parameter for qmp_block_set_io_throttle()
> 
>  block/block-backend.c   | 14 -
>  block/io.c  | 58 +
>  block/qapi-system.c | 10 +++-
>  include/block/aio-wait.h| 47 -
>  include/block/block-io.h| 22 +++-
>  include/system/block-backend-global-state.h |  1 +
>  qapi/block-core.json|  5 +-
>  util/aio-wait.c |  5 ++
>  8 files changed, 135 insertions(+), 27 deletions(-)
> 
> -- 
> 2.33.0
> 

Aside from Markus' comments:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v2 2/4] docs, qapi: generate undocumented return sections

2025-04-05 Thread John Snow

On Tue, Apr 1, 2025 at 2:07 AM Markus Armbruster  wrote:

> John Snow  writes:
>
> > On Thu, Mar 27, 2025 at 5:11 AM Markus Armbruster 
> wrote:
> >
> >> John Snow  writes:
> >>
> >> > This patch changes the qapidoc transmogrifier to generate Return value
> >> > documentation for any command that has a return value but hasn't
> >> > explicitly documented that return value.
> >> >
> >> > Signed-off-by: John Snow 
> >>
> >> Might want to briefly explain placement of the auto-generated return
> >> value documentation.  But before we discuss that any further, let's
> >> review the actual changes the the generated docs.
> >>
> >> This patch adds auto-generated return value documentation where we have
> >> none.
> >>
> >> The next patch replaces handwritten by auto-generated return value
> >> documentation where these are at least as good.  Moves the return value
> >> docs in some cases.
> >>
> >> First the additions:
> >>
> >> * x-debug-query-block-graph
> >>
> >>   Title, intro, features, return
> >>
> >> * query-tpm
> >>
> >>   Title, intro, return, example
> >>
> >> * query-dirty-rate
> >>
> >>   Title, intro, arguments, return, examples
> >>
> >> * query-vcpu-dirty-limit
> >>
> >>   Title, intro, return, example
> >>
> >> * query-vm-generation-id
> >>
> >>   Title, return
> >>
> >> * query-memory-size-summary
> >>
> >>   Title, intro, example, return
> >>
> >> * query-memory-devices
> >>
> >>   Title, intro, return, example
> >>
> >> * query-acpi-ospm-status
> >>
> >>   Title, intro, return, example
> >>
> >> * query-stats-schemas
> >>
> >>   Title, intro, arguments, note, return
> >>
> >> Undesirable:
> >>
> >> * query-memory-size-summary has returns after the example instead of
> >>   before.  I figure it needs the TODO hack to separate intro and example
> >>   (see announce-self).
> >>
> >
> > Yes
> >
> >
> >>
> >> * query-stats-schemas has a note between arguments and return.  I think
> >>   this demonstrates that the placement algorithm is too simplistic.
> >>
> >
> > Yeah ... It's just that *glances at length of email* it's been a
> sensitive
> > topic without a lot of certainty in desire, so I've tried to keep things
> on
> > the stupid/simple side ...
>
> When the best solution is unclear, starting discussion with a simplistic
> solution is smart.  Beats starting with a complicated solution that gets
> shot down.
>
> >> Debatable:
> >>
> >> * x-debug-query-block-graph has returns after features.  I'd prefer
> >>   returns before features.  No need to debate this now.
> >>
> >
> > Willing to do this for you if you'd like to enforce it globally. I'd be
> > happy with any mandated order of sections, really.
>
> Could a more rigid order help the inliner, too?
>

Oh, absolutely. My series that adds it still assumes a rigid ordering. It
just never enforced it. I don't really care what the order even is, just as
long as there is one.


>
> >> Next the movements:
> >>
> >> * x-debug-block-dirty-bitmap-sha256
> >>
> >>   From right before errors to right after
> >>
> >> * blockdev-snapshot-delete-internal-sync
> >>
> >>   From right before errors to right after
> >>
> >> * query-xen-replication-status
> >>
> >>   From between intro and example to the end
> >>
> >> * query-colo-status
> >>
> >>   From between intro and example to the end
> >>
> >> * query-balloon
> >>
> >>   From right before errors to right after
> >>
> >> * query-hv-balloon-status-report
> >>
> >>   From right before errors to right after
> >>
> >> * query-yank
> >>
> >>   From between intro and example to the end
> >>
> >> * add-fd
> >>
> >>   From between arguments and errors to between last note and example
> >>
> >> I don't like any of these :)
> >>
> >
> > So it goes ...
> >
> >
> >>
> >> Undesirable:
> >>
> >> * query-xen-replication-status, query-yank, and query-colo-status now
> >>   have return after the example instead of before.  I figure they now
> >>   need the TODO hack to separate intro and example.
> >>
> >
> > Yes
> >
> >
> >>
> >> * add-fd now has a note between arguments and return.  Same placement
> >>   weakness as for query-stats above.
> >>
> >> Debatable:
> >>
> >> * x-debug-block-dirty-bitmap-sha256,
> >>   blockdev-snapshot-delete-internal-sync, query-colo-status, and
> >>   query-hv-balloon-status-report now have return after errors instead of
> >>   before.  I'd prefer before.
> >>
> >> What's the stupidest acceptable placement algorithm?  Maybe this one:
> >>
> >> 1. If we have arguments, return goes right after them.
> >>
> >> 2. Else if we have errors, return goes right before them.
> >>
> >> 3. Else if we have features, return goes right before them.
> >>
> >> 4. Else return goes right after the intro (to make this work, we need
> >>a few more TODO hacks).
> >>
> >> Would you be willing to give this a try?
> >>
> >
> > Probably ...
> >
> > So this algorithm seems to imply a preference for this ordering:
> >
> > 1. Intro
> > 2. Arguments
> > 3. Return
> > 4. Errors
> > 5. Features
> > 6. Details
> >
> > Do

[PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()

2025-04-05 Thread Hanna Czenczek

Pull setting up and tearing down the AIO context handlers into two
dedicated functions.

Signed-off-by: Hanna Czenczek 
---
 block/export/fuse.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 2df6297d61..bd98809d71 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
-static void fuse_export_drained_begin(void *opaque)
+static void fuse_attach_handlers(FuseExport *exp)
 {
-FuseExport *exp = opaque;
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session),
+   read_from_fuse_export, NULL, NULL, NULL, exp);
+exp->fd_handler_set_up = true;
+}
 
+static void fuse_detach_handlers(FuseExport *exp)
+{
 aio_set_fd_handler(exp->common.ctx,
fuse_session_fd(exp->fuse_session),
NULL, NULL, NULL, NULL, NULL);
 exp->fd_handler_set_up = false;
 }
 
+static void fuse_export_drained_begin(void *opaque)
+{
+fuse_detach_handlers(opaque);
+}
+
 static void fuse_export_drained_end(void *opaque)
 {
 FuseExport *exp = opaque;
 
 /* Refresh AioContext in case it changed */
 exp->common.ctx = blk_get_aio_context(exp->common.blk);
-
-aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session),
-   read_from_fuse_export, NULL, NULL, NULL, exp);
-exp->fd_handler_set_up = true;
+fuse_attach_handlers(exp);
 }
 
 static bool fuse_export_drained_poll(void *opaque)
@@ -209,11 +216,7 @@ static int fuse_export_create(BlockExport *blk_exp,
 
 g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session),
-   read_from_fuse_export, NULL, NULL, NULL, exp);
-exp->fd_handler_set_up = true;
-
+fuse_attach_handlers(exp);
 return 0;
 
 fail:
@@ -329,10 +332,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 fuse_session_exit(exp->fuse_session);
 
 if (exp->fd_handler_set_up) {
-aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session),
-   NULL, NULL, NULL, NULL, NULL);
-exp->fd_handler_set_up = false;
+fuse_detach_handlers(exp);
 }
 }
 
-- 
2.48.1

Re: [PATCH v9 4/7] qapi: add blockdev-replace command

2025-04-05 Thread Vladimir Sementsov-Ogievskiy


On 18.10.24 16:59, Kevin Wolf wrote:

If we want to get rid of the union, I think the best course of action
would unifying the namespaces (so that nodes, exports and devices can't
share the same ID) and then we could just accept a universal 'id' along
with 'child'.


Maybe we can go this way even without explicit restriction (which should
some how go through deprecation period, etc), but simply look for the id
among nodes, devices and exports and if found more than one parent - fail.

And we document, that id should not be ambiguous, should not match more
than one parent object. So, those who want to use new command will care
to make unique ids.

--
Best regards,
Vladimir

[PATCH v2 4/4] qapi: rephrase return docs to avoid type name

2025-04-05 Thread John Snow

Well, I tried. Maybe not very hard. Sorry!

Signed-off-by: John Snow 
---
 qapi/block-core.json | 6 +++---
 qapi/block-export.json   | 2 +-
 qapi/block.json  | 2 +-
 qapi/control.json| 5 ++---
 qapi/dump.json   | 5 ++---
 qapi/introspect.json | 6 +++---
 qapi/job.json| 2 +-
 qapi/machine-target.json | 7 +++
 qapi/misc-target.json| 2 +-
 qapi/misc.json   | 5 ++---
 qapi/net.json| 2 +-
 qapi/pci.json| 2 +-
 qapi/qdev.json   | 3 +--
 qapi/qom.json| 8 +++-
 qapi/stats.json  | 2 +-
 qapi/trace.json  | 2 +-
 qapi/ui.json | 2 +-
 qapi/virtio.json | 6 +++---
 18 files changed, 31 insertions(+), 38 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 92b2e368b72..eb97b70cd80 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -763,7 +763,7 @@
 #
 # Get a list of BlockInfo for all virtual block devices.
 #
-# Returns: a list of @BlockInfo describing each virtual block device.
+# Returns: a list describing each virtual block device.
 # Filter nodes that were created implicitly are skipped over.
 #
 # Since: 0.14
@@ -1168,7 +1168,7 @@
 # nodes that were created implicitly are skipped over in this
 # mode.  (Since 2.3)
 #
-# Returns: A list of @BlockStats for each virtual block devices.
+# Returns: A list of statistics for each virtual block device.
 #
 # Since: 0.14
 #
@@ -1440,7 +1440,7 @@
 #
 # Return information about long-running block device operations.
 #
-# Returns: a list of @BlockJobInfo for each active block job
+# Returns: a list of job info for each active block job
 #
 # Since: 1.1
 ##
diff --git a/qapi/block-export.json b/qapi/block-export.json
index c783e01a532..84852606e52 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -472,7 +472,7 @@
 ##
 # @query-block-exports:
 #
-# Returns: A list of BlockExportInfo describing all block exports
+# Returns: A list describing all block exports
 #
 # Since: 5.2
 ##
diff --git a/qapi/block.json b/qapi/block.json
index e6f5c64..bdbbe78854f 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -86,7 +86,7 @@
 # Returns a list of information about each persistent reservation
 # manager.
 #
-# Returns: a list of @PRManagerInfo for each persistent reservation
+# Returns: a list of manager info for each persistent reservation
 # manager
 #
 # Since: 3.0
diff --git a/qapi/control.json b/qapi/control.json
index 336386f79e1..2e45bf25df8 100644
--- a/qapi/control.json
+++ b/qapi/control.json
@@ -93,8 +93,7 @@
 #
 # Returns the current version of QEMU.
 #
-# Returns: A @VersionInfo object describing the current version of
-# QEMU.
+# Returns: An object describing the current version of QEMU.
 #
 # Since: 0.14
 #
@@ -131,7 +130,7 @@
 #
 # Return a list of supported QMP commands by this server
 #
-# Returns: A list of @CommandInfo for all supported commands
+# Returns: A list of all supported commands
 #
 # Since: 0.14
 #
diff --git a/qapi/dump.json b/qapi/dump.json
index d7826c0e323..1bd6bacc5ce 100644
--- a/qapi/dump.json
+++ b/qapi/dump.json
@@ -146,7 +146,7 @@
 #
 # Query latest dump status.
 #
-# Returns: A @DumpStatus object showing the dump status.
+# Returns: An object showing the dump status.
 #
 # Since: 2.6
 #
@@ -197,8 +197,7 @@
 #
 # Returns the available formats for dump-guest-memory
 #
-# Returns: A @DumpGuestMemoryCapability object listing available
-# formats for dump-guest-memory
+# Returns: An object listing available formats for dump-guest-memory
 #
 # Since: 2.0
 #
diff --git a/qapi/introspect.json b/qapi/introspect.json
index 01bb242947c..7daec5045fb 100644
--- a/qapi/introspect.json
+++ b/qapi/introspect.json
@@ -34,10 +34,10 @@
 # string into a specific enum or from one specific type into an
 # alternate that includes the original type alongside something else.
 #
-# Returns: array of @SchemaInfo, where each element describes an
-# entity in the ABI: command, event, type, ...
+# Returns: an array where each element describes an entity in the ABI:
+# command, event, type, ...
 #
-# The order of the various SchemaInfo is unspecified; however, all
+# The order of the various elements is unspecified; however, all
 # names are guaranteed to be unique (no name will be duplicated
 # with different meta-types).
 #
diff --git a/qapi/job.json b/qapi/job.json
index cfc3beedd21..856dd688f95 100644
--- a/qapi/job.json
+++ b/qapi/job.json
@@ -269,7 +269,7 @@
 #
 # Return information about jobs.
 #
-# Returns: a list with a @JobInfo for each active job
+# Returns: a list with info for each active job
 #
 # Since: 3.0
 ##
diff --git a/qapi/machine-target.json b/qapi/machine-target.json
index 37e75520094..6d8a6e53436 100644
--- a/qapi/machine-target.json
+++ b/qapi/machine-target.json
@@ -162,8 +162,7 @@
 # @modelb: description of the second CPU model to compare, referred to
 # as "model B" in CpuModelCompareResult
 #
-#

[PATCH v2 1/2] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA

2025-04-05 Thread Pinku Deb Nath

Full Unit Access (FUA) is an optimization where a disk write with the
flag set will be persisted to disk immediately instead of potentially
remaining in the disk's write cache. This commit address the todo task
for using pwritev2() with RWF_DSYNC in the thread pool section of
raw_co_prw(), if pwritev2 with RWF_DSYNC is available in the host,
which is alway for Linux kernel >= 4.7. The intent for FUA is indicated
with the BDRV_REQ_FUA flag. The old code paths are preserved in case
BDRV_REQ_FUA is off or pwritev2() with RWF_DSYNC is not available.

During testing, I observed that the BDRV_REQ_FUA is always turned on
when blk->enable_write_cache is not set in block/block-backend.c, so
I commented this section off during testing:
https://gitlab.com/qemu-project/qemu/-/blob/master/block/block-backend.c?ref_type=heads#L1432-1434

Signed-off-by: Pinku Deb Nath 
---
 block/file-posix.c | 42 ++
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 56d1972d15..34de816eab 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -229,6 +229,7 @@ typedef struct RawPosixAIOData {
 unsigned long op;
 } zone_mgmt;
 };
+BdrvRequestFlags flags;
 } RawPosixAIOData;
 
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -1674,6 +1675,16 @@ qemu_pwritev(int fd, const struct iovec *iov, int 
nr_iov, off_t offset)
 return pwritev(fd, iov, nr_iov, offset);
 }
 
+static ssize_t
+qemu_pwrite_fua(int fd, const struct iovec *iov, int nr_iov, off_t offset)
+{
+#ifdef RWF_DSYNC
+return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC);
+#else
+return pwritev2(fd, iov, nr_iov, offset, 0);
+#endif
+}
+
 #else
 
 static bool preadv_present = false;
@@ -1698,10 +1709,15 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 
 len = RETRY_ON_EINTR(
 (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
-qemu_pwritev(aiocb->aio_fildes,
-   aiocb->io.iov,
-   aiocb->io.niov,
-   aiocb->aio_offset) :
+(aiocb->flags &  BDRV_REQ_FUA) ?
+qemu_pwrite_fua(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset) :
+qemu_pwritev(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset) :
 qemu_preadv(aiocb->aio_fildes,
   aiocb->io.iov,
   aiocb->io.niov,
@@ -1727,10 +1743,17 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 
 while (offset < aiocb->aio_nbytes) {
 if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
-len = pwrite(aiocb->aio_fildes,
- (const char *)buf + offset,
- aiocb->aio_nbytes - offset,
- aiocb->aio_offset + offset);
+if (aiocb->flags & BDRV_REQ_FUA) {
+len = qemu_pwrite_fua(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset);
+} else {
+len = pwrite(aiocb->aio_fildes,
+(const char *)buf + offset,
+aiocb->aio_nbytes - offset,
+aiocb->aio_offset + offset);
+}
 } else {
 len = pread(aiocb->aio_fildes,
 buf + offset,
@@ -2539,14 +2562,17 @@ static int coroutine_fn raw_co_prw(BlockDriverState 
*bs, int64_t *offset_ptr,
 .iov= qiov->iov,
 .niov   = qiov->niov,
 },
+.flags  = flags,
 };
 
 assert(qiov->size == bytes);
 ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
+#ifndef RWD_DSYNC
 if (ret == 0 && (flags & BDRV_REQ_FUA)) {
 /* TODO Use pwritev2() instead if it's available */
 ret = raw_co_flush_to_disk(bs);
 }
+#endif
 goto out; /* Avoid the compiler err of unused label */
 
 out:
-- 
2.43.0

Re: [PATCH 01/15] fuse: Copy write buffer content before polling

2025-04-05 Thread Eric Blake

On Tue, Mar 25, 2025 at 05:06:35PM +0100, Hanna Czenczek wrote:
> Polling in I/O functions can lead to nested read_from_fuse_export()
> calls, overwriting the request buffer's content.  The only function
> affected by this is fuse_write(), which therefore must use a bounce
> buffer or corruption may occur.
> 
> Note that in addition we do not know whether libfuse-internal structures
> can cope with this nesting, and even if we did, we probably cannot rely
> on it in the future.  This is the main reason why we want to remove
> libfuse from the I/O path.
> 

> @@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, 
> const char *buf,
> size_t size, off_t offset, struct fuse_file_info *fi)
>  {
>  FuseExport *exp = fuse_req_userdata(req);
> +void *copied;

Do we have a good way to annotate variables that require qemu_vfree()
if non-NULL for automatic cleanup?  If so, should this be annotated
and set to NULL here,...

>  int64_t length;
>  int ret;
>  
> @@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, 
> const char *buf,
>  return;
>  }
>  
> +/*
> + * Heed the note on read_from_fuse_export(): If we poll (which any 
> blk_*()
> + * I/O function may do), read_from_fuse_export() may be nested, 
> overwriting
> + * the request buffer content.  Therefore, we must copy it here.
> + */
> +copied = blk_blockalign(exp->common.blk, size);
> +memcpy(copied, buf, size);
> +
>  /**
>   * Clients will expect short writes at EOF, so we have to limit
>   * offset+size to the image length.
> @@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, 
> const char *buf,
>  length = blk_getlength(exp->common.blk);
>  if (length < 0) {
>  fuse_reply_err(req, -length);
> -return;
> +goto free_buffer;

...so that this and similar hunks are not needed?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

[PATCH v4 0/3] qapi: Fix some command blocked for too long

2025-04-05 Thread zoudongjie via

From: Zhu Yangyang 

QMP may will be blocked for a long time on bdrv_drained_begin() when a
network disk is configured and the network failure is just about to occur.

In theory, any command may be blocked if it calls bdrv_drained_begin().

This series add a timeout parameter for qmp_block_set_io_throttle() and
qmp_block_resize() to control their execution duration.

Changelog
v4 ---
  1. Corrected the added version of the timeout parameter from 10.0 to 10.1
  2. Add a patch to add the timeout parameter for qmp_block_resize().

v3 ---
  Unify AIO_WAIT_WHILE_{TIMEOUT/INTERNAL} by replacing 
AIO_WAIT_WHILE_INTERNAL() with
  AIO_WAIT_WHILE_TIMEOUT(..., 0).

v2 ---
  1. Support 0 in BDRV_POLL_WHILE_TIMEOUT(), 0 means infinite.
  2. Use uint64_t timeout_ns instead of int64 timeout to name variables.
  3. Use timer_pending() to check for expiry instead of explicitly checking
 against the deadline for BDRV_POLL_WHILE_TIMEOUT().
  4. Add documentation for bdrv_drained_begin_timeout(), note that 
bdrv_drained_end()
 must be called when -ETIMEDOUT is returned.
  5. Add a timeout parameter to the qmp_block_set_io_throttle() instead of 
hardcoding
 the timeout, and the default value is 0, mean an infinite timeout.

v1 patch link:
https://lore.kernel.org/qemu-devel/20250308101618.721954-1-zoudong...@huawei.com/


Zhu Yangyang (3):
  io/block: Refactoring the bdrv_drained_begin() function and implement
a timeout mechanism.
  qapi/throttle: add timeout parameter for qmp_block_set_io_throttle()
  qapi/resize: add timeout parameter for qmp_block_resize()

 block/block-backend.c   | 14 -
 block/io.c  | 58 +
 block/monitor/block-hmp-cmds.c  |  2 +-
 block/qapi-system.c | 10 +++-
 blockdev.c  | 16 +-
 include/block/aio-wait.h| 47 -
 include/block/block-io.h| 22 +++-
 include/system/block-backend-global-state.h |  1 +
 qapi/block-core.json| 11 +++-
 util/aio-wait.c |  5 ++
 10 files changed, 155 insertions(+), 31 deletions(-)

-- 
2.33.0

Re: [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal

2025-04-05 Thread Stefan Hajnoczi

On Tue, Mar 25, 2025 at 05:06:42PM +0100, Hanna Czenczek wrote:
> When exports are created on the command line (with the storage daemon),
> errp is going to point to error_fatal.  Without ERRP_GUARD, we would
> exit immediately when *errp is set, i.e. skip the clean-up code under
> the `fail` label.  Use ERRP_GUARD so we always run that code.
> 
> As far as I know, this has no actual impact right now[1], but it is
> still better to make this right.
> 
> [1] Not cleaning up the mount point is the only thing I can imagine
> would be problematic, but that is the last thing we attempt, so if
> it fails, it will clean itself up.
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  block/export/fuse.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[PULL 1/1] Revert "iotests: Stop NBD server in test 162 before starting the next one"

2025-04-05 Thread Eric Blake

From: Thomas Huth 

This reverts commit e2668ba1ed44ad56f2f1653ff5f53b277d534fac.

This commit made test 162 fail occasionally with:

 162   fail  [13:06:40] [13:06:40]   0.2s   (last: 0.2s)  output mismatch
 --- tests/qemu-iotests/162.out
 +++ tests/qemu-iotests/scratch/qcow2-file-162/162.out.bad
 @@ -3,6 +3,7 @@
  === NBD ===
  qemu-img: Could not open 'json:{"driver": "nbd", "host": -1}': address
   resolution failed for -1:10809: Name or service not known
  image: nbd://localhost:PORT
 +./common.rc: line 371: kill: (891116) - No such process
  image: nbd+unix://?socket=42

The nbd server should normally terminate automatically, so trying to
kill it here now seems to cause a race that will cause a test failure
when the server terminated before the kill command has been executed.

The "Stop NBD server" patch has originally been written to solve another
problem with a hanging nbd server, but since that problem has been properly
solved by commit 3e1683485656, we now don't need the "_stop_nbd_server" here
anymore.

Reviewed-by: Hanna Czenczek 
Signed-off-by: Thomas Huth 
Message-ID: <20250326143533.932899-1-th...@redhat.com>
Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/162 | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tests/qemu-iotests/162 b/tests/qemu-iotests/162
index 956c2c5f339..94dae60d304 100755
--- a/tests/qemu-iotests/162
+++ b/tests/qemu-iotests/162
@@ -65,7 +65,6 @@ done

 $QEMU_IMG info "json:{'driver': 'nbd', 'host': 'localhost', 'port': $port}" \
 | grep '^image' | sed -e "s/$port/PORT/"
-_stop_nbd_server

 # This is a test for NBD's bdrv_refresh_filename() implementation: It expects
 # either host or path to be set, but it must not assume that they are set to
-- 
2.48.1

[PATCH v2 1/3] util: Add functions for s390x mmio read/write

2025-04-05 Thread Farhan Ali

Starting with z15 (or newer) we can execute mmio
instructions from userspace. On older platforms
where we don't have these instructions available
we can fallback to using system calls to access
the PCI mapped resources.

This patch adds helper functions for mmio reads
and writes for s390x.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Farhan Ali 
---
 include/qemu/s390x_pci_mmio.h |  23 ++
 util/meson.build  |   2 +
 util/s390x_pci_mmio.c | 148 ++
 3 files changed, 173 insertions(+)
 create mode 100644 include/qemu/s390x_pci_mmio.h
 create mode 100644 util/s390x_pci_mmio.c

diff --git a/include/qemu/s390x_pci_mmio.h b/include/qemu/s390x_pci_mmio.h
new file mode 100644
index 00..aead791475
--- /dev/null
+++ b/include/qemu/s390x_pci_mmio.h
@@ -0,0 +1,23 @@
+/*
+ * s390x PCI MMIO definitions
+ *
+ * Copyright 2025 IBM Corp.
+ * Author(s): Farhan Ali 
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+#ifndef S390X_PCI_MMIO_H
+#define S390X_PCI_MMIO_H
+
+uint8_t s390x_pci_mmio_read_8(const void *ioaddr);
+uint16_t s390x_pci_mmio_read_16(const void *ioaddr);
+uint32_t s390x_pci_mmio_read_32(const void *ioaddr);
+uint64_t s390x_pci_mmio_read_64(const void *ioaddr);
+
+void s390x_pci_mmio_write_8(void *ioaddr, uint8_t val);
+void s390x_pci_mmio_write_16(void *ioaddr, uint16_t val);
+void s390x_pci_mmio_write_32(void *ioaddr, uint32_t val);
+void s390x_pci_mmio_write_64(void *ioaddr, uint64_t val);
+
+
+#endif
diff --git a/util/meson.build b/util/meson.build
index 780b5977a8..acb21592f9 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -131,4 +131,6 @@ elif cpu in ['ppc', 'ppc64']
   util_ss.add(files('cpuinfo-ppc.c'))
 elif cpu in ['riscv32', 'riscv64']
   util_ss.add(files('cpuinfo-riscv.c'))
+elif cpu == 's390x'
+  util_ss.add(files('s390x_pci_mmio.c'))
 endif
diff --git a/util/s390x_pci_mmio.c b/util/s390x_pci_mmio.c
new file mode 100644
index 00..820458a026
--- /dev/null
+++ b/util/s390x_pci_mmio.c
@@ -0,0 +1,148 @@
+/*
+ * s390x PCI MMIO definitions
+ *
+ * Copyright 2025 IBM Corp.
+ * Author(s): Farhan Ali 
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include 
+#include 
+#include "qemu/s390x_pci_mmio.h"
+#include "elf.h"
+
+union register_pair {
+unsigned __int128 pair;
+struct {
+uint64_t even;
+uint64_t odd;
+};
+};
+
+static bool is_mio_supported;
+
+static __attribute__((constructor)) void check_is_mio_supported(void)
+{
+is_mio_supported = !!(qemu_getauxval(AT_HWCAP) & HWCAP_S390_PCI_MIO);
+}
+
+static uint64_t s390x_pcilgi(const void *ioaddr, size_t len)
+{
+union register_pair ioaddr_len = { .even = (uint64_t)ioaddr,
+   .odd = len };
+uint64_t val;
+int cc;
+
+asm volatile(
+/* pcilgi */
+".insn   rre,0xb9d6,%[val],%[ioaddr_len]\n"
+"ipm %[cc]\n"
+"srl %[cc],28\n"
+: [cc] "=d"(cc), [val] "=d"(val),
+[ioaddr_len] "+&d"(ioaddr_len.pair) :: "cc");
+
+if (cc) {
+val = -1ULL;
+}
+
+return val;
+}
+
+static void s390x_pcistgi(void *ioaddr, uint64_t val, size_t len)
+{
+union register_pair ioaddr_len = {.even = (uint64_t)ioaddr, .odd = len};
+
+asm volatile (
+/* pcistgi */
+".insn   rre,0xb9d4,%[val],%[ioaddr_len]\n"
+: [ioaddr_len] "+&d" (ioaddr_len.pair)
+: [val] "d" (val)
+: "cc", "memory");
+}
+
+uint8_t s390x_pci_mmio_read_8(const void *ioaddr)
+{
+uint8_t val = 0;
+
+if (is_mio_supported) {
+val = s390x_pcilgi(ioaddr, sizeof(val));
+} else {
+syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val));
+}
+return val;
+}
+
+uint16_t s390x_pci_mmio_read_16(const void *ioaddr)
+{
+uint16_t val = 0;
+
+if (is_mio_supported) {
+val = s390x_pcilgi(ioaddr, sizeof(val));
+} else {
+syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val));
+}
+return val;
+}
+
+uint32_t s390x_pci_mmio_read_32(const void *ioaddr)
+{
+uint32_t val = 0;
+
+if (is_mio_supported) {
+val = s390x_pcilgi(ioaddr, sizeof(val));
+} else {
+syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val));
+}
+return val;
+}
+
+uint64_t s390x_pci_mmio_read_64(const void *ioaddr)
+{
+uint64_t val = 0;
+
+if (is_mio_supported) {
+val = s390x_pcilgi(ioaddr, sizeof(val));
+} else {
+syscall(__NR_s390_pci_mmio_read, ioaddr, &val, sizeof(val));
+}
+return val;
+}
+
+void s390x_pci_mmio_write_8(void *ioaddr, uint8_t val)
+{
+if (is_mio_supported) {
+s390x_pcistgi(ioaddr, val, sizeof(val));
+} else {
+syscall(__NR_s390_pci_mmio_write, ioaddr, &val, sizeof(val));
+}
+}
+
+void s390x_pci_mmio_write_16(void *ioaddr, uint16_t val)
+{
+if (is_mio_supported) {
+s390x_pcistgi(ioaddr, val, sizeof(val));
+} else {
+syscall(__NR_s3

[PATCH v2 1/2] io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism.

2025-04-05 Thread zoudongjie via

From: Zhu Yangyang 

The bdrv_drained_begin() function is a blocking function. In scenarios where 
network storage
is used and network links fail, it may block for a long time.
Therefore, we add a timeout parameter to control the duration of the block.

Since bdrv_drained_begin() has been widely adopted, both bdrv_drained_begin()
and bdrv_drained_begin_timeout() will be retained.

Signed-off-by: Zhu Yangyang 
---
 block/io.c   | 58 +---
 include/block/aio-wait.h | 49 +
 include/block/block-io.h | 22 ++-
 util/aio-wait.c  |  5 
 4 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/block/io.c b/block/io.c
index 1ba8d1aeea..912b76c4a4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -255,6 +255,8 @@ typedef struct {
 bool begin;
 bool poll;
 BdrvChild *parent;
+uint64_t timeout_ns;
+int ret;
 } BdrvCoDrainData;
 
 /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
@@ -283,6 +285,10 @@ static bool bdrv_drain_poll_top_level(BlockDriverState *bs,
 return bdrv_drain_poll(bs, ignore_parent, false);
 }
 
+static int bdrv_do_drained_begin_timeout(BlockDriverState *bs,
+ BdrvChild *parent,
+ bool poll,
+ uint64_t timeout_ns);
 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
   bool poll);
 static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent);
@@ -296,7 +302,9 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 if (bs) {
 bdrv_dec_in_flight(bs);
 if (data->begin) {
-bdrv_do_drained_begin(bs, data->parent, data->poll);
+data->ret = bdrv_do_drained_begin_timeout(bs, data->parent,
+  data->poll,
+  data->timeout_ns);
 } else {
 assert(!data->poll);
 bdrv_do_drained_end(bs, data->parent);
@@ -310,10 +318,11 @@ static void bdrv_co_drain_bh_cb(void *opaque)
 aio_co_wake(co);
 }
 
-static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
-bool begin,
-BdrvChild *parent,
-bool poll)
+static int coroutine_fn bdrv_co_yield_to_drain_timeout(BlockDriverState *bs,
+   bool begin,
+   BdrvChild *parent,
+   bool poll,
+   uint64_t timeout_ns)
 {
 BdrvCoDrainData data;
 Coroutine *self = qemu_coroutine_self();
@@ -329,6 +338,8 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 .begin = begin,
 .parent = parent,
 .poll = poll,
+.timeout_ns = timeout_ns,
+.ret = 0
 };
 
 if (bs) {
@@ -342,16 +353,27 @@ static void coroutine_fn 
bdrv_co_yield_to_drain(BlockDriverState *bs,
 /* If we are resumed from some other event (such as an aio completion or a
  * timer callback), it is a bug in the caller that should be fixed. */
 assert(data.done);
+return data.ret;
 }
 
-static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
-  bool poll)
+static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
+bool begin,
+BdrvChild *parent,
+bool poll)
+{
+bdrv_co_yield_to_drain_timeout(bs, begin, parent, poll, 0);
+}
+
+static int bdrv_do_drained_begin_timeout(BlockDriverState *bs,
+ BdrvChild *parent,
+ bool poll,
+ uint64_t timeout_ns)
 {
 IO_OR_GS_CODE();
 
 if (qemu_in_coroutine()) {
-bdrv_co_yield_to_drain(bs, true, parent, poll);
-return;
+return bdrv_co_yield_to_drain_timeout(bs, true, parent, poll,
+  timeout_ns);
 }
 
 GLOBAL_STATE_CODE();
@@ -375,8 +397,17 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, 
BdrvChild *parent,
  * nodes.
  */
 if (poll) {
-BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
+return BDRV_POLL_WHILE_TIMEOUT(bs,
+   bdrv_drain_poll_top_level(bs, parent),
+   timeout_ns);
 }
+return 0;
+}
+
+static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
+  bool

Re: [PATCH 04/15] fuse: Explicitly set inode ID to 1

2025-04-05 Thread Stefan Hajnoczi

On Tue, Mar 25, 2025 at 05:06:44PM +0100, Hanna Czenczek wrote:
> Setting .st_ino to the FUSE inode ID is kind of arbitrary.  While in
> practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because
> we only have the root inode, that is not obvious in fuse_getattr().
> 
> Just explicitly set it to 1 (i.e. no functional change).
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  block/export/fuse.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

[PATCH v1 0/2] Enable QEMU NVMe userspace driver on s390x

2025-04-05 Thread Farhan Ali

Hi,

Recently on s390x we have enabled mmap support for vfio-pci devices [1].
This allows us to take advantage and use userspace drivers on s390x. However,
on s390x we have special instructions for MMIO access. Starting with z15 
(and newer platforms) we have new PCI Memory I/O (MIO) instructions which 
operate on virtually mapped PCI memory spaces, and can be used from userspace.
On older platforms we would fallback to using existing system calls for MMIO 
access.

This patch series introduces support the PCI MIO instructions, and enables s390x
support for the userspace NVMe driver on s390x. I would appreciate any 
review/feedback
on the patches.

Thanks
Farhan

[1] 
https://lore.kernel.org/linux-s390/20250226-vfio_pci_mmap-v7-0-c5c0f1d26...@linux.ibm.com/

Farhan Ali (2):
  util: Add functions for s390x mmio read/write
  block/nvme: Enable NVMe userspace driver for s390x

 block/nvme.c  |  95 --
 include/qemu/s390x_pci_mmio.h |  17 ++
 util/meson.build  |   2 +
 util/s390x_pci_mmio.c | 105 ++
 4 files changed, 201 insertions(+), 18 deletions(-)
 create mode 100644 include/qemu/s390x_pci_mmio.h
 create mode 100644 util/s390x_pci_mmio.c

-- 
2.43.0

Re: [PATCH v2 1/2] io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism.

2025-04-05 Thread Stefan Hajnoczi

On Fri, Mar 21, 2025 at 03:09:16PM +0800, zoudongjie wrote:
> From: Zhu Yangyang 
> 
> The bdrv_drained_begin() function is a blocking function. In scenarios where 
> network storage
> is used and network links fail, it may block for a long time.
> Therefore, we add a timeout parameter to control the duration of the block.
> 
> Since bdrv_drained_begin() has been widely adopted, both bdrv_drained_begin()
> and bdrv_drained_begin_timeout() will be retained.
> 
> Signed-off-by: Zhu Yangyang 
> ---
>  block/io.c   | 58 +---
>  include/block/aio-wait.h | 49 +
>  include/block/block-io.h | 22 ++-
>  util/aio-wait.c  |  5 
>  4 files changed, 123 insertions(+), 11 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 1ba8d1aeea..912b76c4a4 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -255,6 +255,8 @@ typedef struct {
>  bool begin;
>  bool poll;
>  BdrvChild *parent;
> +uint64_t timeout_ns;
> +int ret;
>  } BdrvCoDrainData;
>  
>  /* Returns true if BDRV_POLL_WHILE() should go into a blocking aio_poll() */
> @@ -283,6 +285,10 @@ static bool bdrv_drain_poll_top_level(BlockDriverState 
> *bs,
>  return bdrv_drain_poll(bs, ignore_parent, false);
>  }
>  
> +static int bdrv_do_drained_begin_timeout(BlockDriverState *bs,
> + BdrvChild *parent,
> + bool poll,
> + uint64_t timeout_ns);
>  static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
>bool poll);
>  static void bdrv_do_drained_end(BlockDriverState *bs, BdrvChild *parent);
> @@ -296,7 +302,9 @@ static void bdrv_co_drain_bh_cb(void *opaque)
>  if (bs) {
>  bdrv_dec_in_flight(bs);
>  if (data->begin) {
> -bdrv_do_drained_begin(bs, data->parent, data->poll);
> +data->ret = bdrv_do_drained_begin_timeout(bs, data->parent,
> +  data->poll,
> +  data->timeout_ns);
>  } else {
>  assert(!data->poll);
>  bdrv_do_drained_end(bs, data->parent);
> @@ -310,10 +318,11 @@ static void bdrv_co_drain_bh_cb(void *opaque)
>  aio_co_wake(co);
>  }
>  
> -static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
> -bool begin,
> -BdrvChild *parent,
> -bool poll)
> +static int coroutine_fn bdrv_co_yield_to_drain_timeout(BlockDriverState *bs,
> +   bool begin,
> +   BdrvChild *parent,
> +   bool poll,
> +   uint64_t timeout_ns)
>  {
>  BdrvCoDrainData data;
>  Coroutine *self = qemu_coroutine_self();
> @@ -329,6 +338,8 @@ static void coroutine_fn 
> bdrv_co_yield_to_drain(BlockDriverState *bs,
>  .begin = begin,
>  .parent = parent,
>  .poll = poll,
> +.timeout_ns = timeout_ns,
> +.ret = 0
>  };
>  
>  if (bs) {
> @@ -342,16 +353,27 @@ static void coroutine_fn 
> bdrv_co_yield_to_drain(BlockDriverState *bs,
>  /* If we are resumed from some other event (such as an aio completion or 
> a
>   * timer callback), it is a bug in the caller that should be fixed. */
>  assert(data.done);
> +return data.ret;
>  }
>  
> -static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
> -  bool poll)
> +static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
> +bool begin,
> +BdrvChild *parent,
> +bool poll)
> +{
> +bdrv_co_yield_to_drain_timeout(bs, begin, parent, poll, 0);
> +}
> +
> +static int bdrv_do_drained_begin_timeout(BlockDriverState *bs,
> + BdrvChild *parent,
> + bool poll,
> + uint64_t timeout_ns)
>  {
>  IO_OR_GS_CODE();
>  
>  if (qemu_in_coroutine()) {
> -bdrv_co_yield_to_drain(bs, true, parent, poll);
> -return;
> +return bdrv_co_yield_to_drain_timeout(bs, true, parent, poll,
> +  timeout_ns);
>  }
>  
>  GLOBAL_STATE_CODE();
> @@ -375,8 +397,17 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, 
> BdrvChild *parent,
>   * nodes.
>   */
>  if (poll) {
> -BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
> +retur

[PATCH v5] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA

2025-04-05 Thread Pinku Deb Nath

Full Unit Access (FUA) is an optimization where a disk write with the
flag set will be persisted to disk immediately instead of potentially
remaining in the disk's write cache.

This commit address the todo task
for using pwritev2() with RWF_DSYNC in the thread pool section of
raw_co_prw(), if pwritev2() with RWF_DSYNC is available in the host,
which is always the case for Linux kernel >= 4.7.

The intent for FUA is indicated with the BDRV_REQ_FUA flag.
The old code paths are preserved in case BDRV_REQ_FUA is off
or pwritev2() with RWF_DSYNC is not available.

Support for disk writes with FUA is handled in qemu_pwritev_fua(),
which uses pwritev2() with RWF_DSYNC if available, otherwise falls
back to pwritev2() with no flags followed by flush using
handle_aiocb_flush().

If pwritev2() is not implemented, then disk write in the linear FUA
will fallback to pwrite() + handle_aiocb_flush().

Signed-off-by: Pinku Deb Nath 

---

v4:
- Add fallback when qemu_pwritev_fua() returns ENOSYS
- Similar fallback was not added for handle_aiocb_rw_vector()
since there is a preadv_present check in handle_aiocb_rw()

v3:
- Changed signature to add fd, iov, nr_iov
- Return -ENOSYS for non-Linux hosts

v2:
- Moved handle_aiocb_flush() into qemu_pwritev_fua()
- In handle_aiocb_rw_linear(), iovec with iovcnt=1 is created
based on the assumption that there will be only one buffer
---
 block/file-posix.c | 68 ++
 1 file changed, 56 insertions(+), 12 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 56d1972d15..59bed7866a 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -229,6 +229,7 @@ typedef struct RawPosixAIOData {
 unsigned long op;
 } zone_mgmt;
 };
+BdrvRequestFlags flags;
 } RawPosixAIOData;
 
 #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -1674,6 +1675,20 @@ qemu_pwritev(int fd, const struct iovec *iov, int 
nr_iov, off_t offset)
 return pwritev(fd, iov, nr_iov, offset);
 }
 
+static ssize_t
+qemu_pwritev_fua(int fd, struct iovec *iov, int nr_iov, off_t offset, const 
RawPosixAIOData *aiocb)
+{
+#ifdef RWF_DSYNC
+return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC);
+#else
+ssize_t len = pwritev2(fd, iov, nr_iov, offset, 0);
+if (len == 0) {
+len = handle_aiocb_flush(aiocb);
+}
+return len;
+#endif
+}
+
 #else
 
 static bool preadv_present = false;
@@ -1690,6 +1705,11 @@ qemu_pwritev(int fd, const struct iovec *iov, int 
nr_iov, off_t offset)
 return -ENOSYS;
 }
 
+static ssize_t
+qemu_pwritev_fua(int fd, struct iovec *iov, int nr_iov, off_t offset, const 
RawPosixAIOData *aiocb)
+{
+return -ENOSYS;
+}
 #endif
 
 static ssize_t handle_aiocb_rw_vector(RawPosixAIOData *aiocb)
@@ -1698,10 +1718,16 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 
 len = RETRY_ON_EINTR(
 (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
-qemu_pwritev(aiocb->aio_fildes,
-   aiocb->io.iov,
-   aiocb->io.niov,
-   aiocb->aio_offset) :
+(aiocb->flags &  BDRV_REQ_FUA) ?
+qemu_pwritev_fua(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset,
+aiocb) :
+qemu_pwritev(aiocb->aio_fildes,
+aiocb->io.iov,
+aiocb->io.niov,
+aiocb->aio_offset) :
 qemu_preadv(aiocb->aio_fildes,
   aiocb->io.iov,
   aiocb->io.niov,
@@ -1727,10 +1753,31 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 
 while (offset < aiocb->aio_nbytes) {
 if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
-len = pwrite(aiocb->aio_fildes,
- (const char *)buf + offset,
- aiocb->aio_nbytes - offset,
- aiocb->aio_offset + offset);
+if (aiocb->flags & BDRV_REQ_FUA) {
+struct iovec iov = {
+.iov_base = buf + offset,
+.iov_len = aiocb->aio_nbytes - offset,
+};
+len = qemu_pwritev_fua(aiocb->aio_fildes,
+&iov,
+1,
+aiocb->aio_offset + offset,
+aiocb);
+if (len == -ENOSYS) {
+len = pwrite(aiocb->aio_fildes,
+(const char *)buf + offset,
+aiocb->aio_nbytes - offset,
+aiocb->aio_offset + offset);
+if (len == 0) {
+len = handle_aiocb_flush(aiocb);
+

Re: Can I contribute Vitastor block driver? Or maybe introduce a QAPI plugin system?

2025-04-05 Thread vitalif

Ok, thank you very much for the information!

So I should first try to make my library available in one of the distros used 
for CI. As I understand these are Alpine, Ubuntu, Debian, Fedora, CentOS and 
OpenSUSE. OK, I'll try it first :)

Regarding QAPI, I'd say it would be cool if it was allowed... I'm not sure if 
GPL circumvention is a real risk, because in some way it's already possible via 
the vhost-user protocol %). But it's up to you to decide of course :)

> I'm not speaking for the QEMU project. I hope to be helpful anyway. I
> am the QAPI maintainer, so my thoughts carry a bit more weight there.
> 
> I understand your block driver depends on your libvitastor_client
> library.
> 
> Dependencies that aren't available on at least one supported build
> platform (and thus covered by CI) are problematic. For Linux,
> "available" generally means "provided by the distribution". I doubt
> there's a way around getting your library packaged by distributions.
> 
> The QAPI schema is (for better or worse) fixed at compile time by
> design. Changing that would be a major undertaking. Whether the
> benefits could justify the costs and risks seems rather doubtful to me.
> 
> In my experience, the project invites contributions, not out-of-tree
> extensions. The latter require external interfaces, which can only be
> harder to maintain than internal ones. There's also the risk of abuse
> to circumvent the GPL (I have absolutely no reason to assume you'd try
> that!).

Re: [PATCH 14/15] fuse: Implement multi-threading

2025-04-05 Thread Stefan Hajnoczi

On Tue, Apr 01, 2025 at 03:36:40PM -0500, Eric Blake wrote:
> On Thu, Mar 27, 2025 at 11:55:57AM -0400, Stefan Hajnoczi wrote:
> > On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
> > > FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> > > (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> > > 
> > > We can use this to implement multi-threading.
> > > 
> > > Note that the interface presented here differs from the multi-queue
> > > interface of virtio-blk: The latter maps virtqueues to iothreads, which
> > > allows processing multiple virtqueues in a single iothread.  The
> > > equivalent (processing multiple FDs in a single iothread) would not make
> > > sense for FUSE because those FDs are used in a round-robin fashion by
> > > the FUSE kernel driver.  Putting two of them into a single iothread will
> > > just create a bottleneck.
> > 
> > This text might be outdated. virtio-blk's new iothread-vq-mapping
> > parameter provides the "array of iothreads" mentioned below and a way to
> > assign virtqueues to those IOThreads.
> > 
> 
> > > +++ b/qapi/block-export.json
> > > @@ -179,12 +179,18 @@
> > >  # mount the export with allow_other, and if that fails, try again
> > >  # without.  (since 6.1; default: auto)
> > >  #
> > > +# @iothreads: Enables multi-threading: Handle requests in each of the
> > > +# given iothreads (instead of the block device's iothread, or the
> > > +# export's "main" iothread).  For this, the FUSE FD is duplicated so
> > > +# there is one FD per iothread.  (since 10.1)
> > 
> > This option isn't FUSE-specific but FUSE is the first export type to
> > support it. Please add it to BlockExportOptions instead and refuse
> > export creation when the export type only supports 1 IOThread.
> > 
> > Eric: Are you interested in implementing support for multiple IOThreads
> > in the NBD export? I remember some time ago we talked about NBD
> > multi-conn support, although maybe that was for the client rather than
> > the server.
> 
> The NBD server already supports clients that make requests through
> multiple TCP sockets, but right now, that server is not taking
> advantage of iothreads to spread the TCP load.
> 
> And yes, I am in the middle of working on adding client NBD multi-conn
> support (reviving Rich Jones' preliminary patches on what it would
> take to have qemu open parallel TCP sockets to a supporting NBD
> server), which also will use a round-robin approach (but here, the
> round-robin is something we would code up in qemu, rather than the
> behavior handed to us by the FUSE kernel layer).  Pinning specific
> iothreads to a specific TCP socket may or may not make sense, but I
> definitely want to have support for handing a pool of iothreads to an
> NBD client that will be using multi-conn.

Here I'm asking Hanna to make the "iothreads" export parameter generic
for all export types (not just for FUSE). Do you think that the NBD
export will be able to use the generic parameter or would you prefer an
NBD-specific export parameter?

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 10/15] fuse: Add halted flag

2025-04-05 Thread Stefan Hajnoczi

On Tue, Mar 25, 2025 at 05:06:50PM +0100, Hanna Czenczek wrote:
> This is a flag that we will want when processing FUSE requests
> ourselves: When the kernel sends us e.g. a truncated request (i.e. we
> receive less data than the request's indicated length), we cannot rely
> on subsequent data to be valid.  Then, we are going to set this flag,
> halting all FUSE request processing.
> 
> We plan to only use this flag in cases that would effectively be kernel
> bugs.
> 
> (Right now, the flag is unused because libfuse still does our request
> processing.)
> 
> Signed-off-by: Hanna Czenczek 
> ---
>  block/export/fuse.c | 30 ++
>  1 file changed, 30 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH] [for-10.1] qapi/block-core: derpecate some block-job- APIs

[PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA

[PATCH 07/11] docs/sphinx/qmp_lexer: Highlight elisions like comments, not prompts

Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

Re: [PATCH v3 0/2] qapi/throttle: Fix qmp_block_set_io_throttle blocked for too long

Re: [PATCH v2 2/4] docs, qapi: generate undocumented return sections

[PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()

Re: [PATCH v9 4/7] qapi: add blockdev-replace command

[PATCH v2 4/4] qapi: rephrase return docs to avoid type name

[PATCH v2 1/2] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA

Re: [PATCH 01/15] fuse: Copy write buffer content before polling

[PATCH v4 0/3] qapi: Fix some command blocked for too long

Re: [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal

[PULL 1/1] Revert "iotests: Stop NBD server in test 162 before starting the next one"

[PATCH v2 1/3] util: Add functions for s390x mmio read/write

[PATCH v2 1/2] io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism.

Re: [PATCH 04/15] fuse: Explicitly set inode ID to 1

[PATCH v1 0/2] Enable QEMU NVMe userspace driver on s390x

Re: [PATCH v2 1/2] io/block: Refactoring the bdrv_drained_begin() function and implement a timeout mechanism.

[PATCH v5] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA

Re: Can I contribute Vitastor block driver? Or maybe introduce a QAPI plugin system?

Re: [PATCH 14/15] fuse: Implement multi-threading

Re: [PATCH 10/15] fuse: Add halted flag

23 matches

Site Navigation

Mail list logo

Footer information