Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-03 Thread Niklas Schnelle
On Wed, 2025-04-02 at 11:51 -0400, Stefan Hajnoczi wrote:
> On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:
> > Hi,
> > 
> > Recently on s390x we have enabled mmap support for vfio-pci devices [1].
> 
> Hi Alex,
> I wanted to bring this to your attention. Feel free to merge it through
> the VFIO tree, otherwise I will merge it once you have taken a look.
> 
> Thanks,
> Stefan
> 
> > This allows us to take advantage and use userspace drivers on s390x. 
> > However,
> > on s390x we have special instructions for MMIO access. Starting with z15 
> > (and newer platforms) we have new PCI Memory I/O (MIO) instructions which 
> > operate on virtually mapped PCI memory spaces, and can be used from 
> > userspace.
> > On older platforms we would fallback to using existing system calls for 
> > MMIO access.
> > 
> > This patch series introduces support the PCI MIO instructions, and enables 
> > s390x
> > support for the userspace NVMe driver on s390x. I would appreciate any 
> > review/feedback
> > on the patches.
> > 
> > Thanks
> > Farhan

Hi Stefan,

the kernel patch actually made it into Linus' tree for v6.15 already as
commit aa9f168d55dc ("s390/pci: Support mmap() of PCI resources except
for ISM devices") plus prerequisites. This went via the PCI tree
because they included a change to struct pci_dev and also enabled
mmap() on PCI resource files. Alex reviewed an earlier version and was
the one who suggested to also enable mmap() on PCI resources.

Thanks,
Niklas

> > 
> > [1] 
> > https://lore.kernel.org/linux-s390/20250226-vfio_pci_mmap-v7-0-c5c0f1d26...@linux.ibm.com/
> > 
> > ChangeLog
> > -
> > v2 series https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06847.html
> > v2 -> v3
> > - Update the PCI MMIO APIs to reflect that its PCI MMIO access on host 
> > as suggested by Stefan(patch 2)
> > - Move s390x ifdef check to s390x_pci_mmio.h as suggested by Philippe 
> > (patch 1)
> > - Add R-bs for the respective patches.
> > 
> > v1 series https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06596.html
> > v1 -> v2
> > - Add 8 and 16 bit reads/writes for completeness (patch 1)
> > - Introduce new QEMU PCI MMIO read/write API as suggested by Stefan 
> > (patch 2)
> > - Update NVMe userspace driver to use QEMU PCI MMIO functions (patch 3)
> > 
> > Farhan Ali (3):
> >   util: Add functions for s390x mmio read/write
> >   include: Add a header to define host PCI MMIO functions
> >   block/nvme: Use host PCI MMIO API
> > 
> >  block/nvme.c  |  37 +
> >  include/qemu/host-pci-mmio.h  | 116 ++
> >  include/qemu/s390x_pci_mmio.h |  24 ++
> >  util/meson.build  |   2 +
> >  util/s390x_pci_mmio.c | 148 ++
> >  5 files changed, 311 insertions(+), 16 deletions(-)
> >  create mode 100644 include/qemu/host-pci-mmio.h
> >  create mode 100644 include/qemu/s390x_pci_mmio.h
> >  create mode 100644 util/s390x_pci_mmio.c
> > 
> > -- 
> > 2.43.0
> > 




[PATCH v2 0/2] [PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA - update

2025-04-03 Thread Pinku Deb Nath
The testing with "-t writeback" works for turning on enable_write_cache.
I renamed the function to qemu_pwritev_fua() and fixed any typos.

I moved the handle_aiocb_flush() into the qemu_pwritev_fua() and
removed from the previously todo seciont. Initially I thought
of only passing aiocb, but then I was not sure whethe I could
derive buf from aiocb, so I added arguments for iovec and iovcnt
into qemu_pwritev_fua().

For handling buf in handle_aiocb_rw_linear(), I created iovec
and passed its reference. I assumed that there will be only one
buffer/iovec, so I passed 1 for iovcnt.

Signed-off-by: Pinku Deb Nath 

Pinku Deb Nath (2):
  block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA
  block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA - update

 block/file-posix.c | 54 +++---
 1 file changed, 42 insertions(+), 12 deletions(-)

-- 
2.43.0




Re: [PATCH] [for-10.1] qapi/block-core: derpecate some block-job- APIs

2025-04-03 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> For change, pause, resume, complete, dismiss and finalize actions
> corresponding job- and block-job commands are almost equal. The
> difference is in find_block_job_locked() vs find_job_locked()
> functions. What's different?
>
> 1. find_block_job_locked() do check, is found job a block-job. This OK
>when moving to more generic API, no needs to document this change.
>
> 2. find_block_job_locked() reports DeviceNotActive on failure, when
>find_job_locked() reports GenericError. So, lets document this
>difference in deprecated.txt. Still, for dismiss and finalize errors
>are not documented at all, so be silent in deprecated.txt as well.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>
> Hi all!
>
> That's a continuation of my "[RFC 00/15] block job API"[1], exactly, the
> simplest part of it - deprecating block-job-* commands which simply
> duplicate job-* ones.
>
> Note that the old series was started with trying to introduce job-change
> command as substitution to both block-job-change (which only can change
> mirror copy-mode parameter), and block-job-set-speed. It was rather
> complicated and controversial attempt, which latest implemenation was
> "[PATCH v3 0/7] introduce job-change qmp command"[2].
>
> In [2] Kevin noted, that we'd better follow existing blockdev-reopen
> approach of changing options (i.e. specify all options) than introduce a
> new one. But, on the other hand, now I'm afraid, that rewriting in
> third tools simple call to (old good) block-job-set-speed into
> job-change(_all_options_ + changed speed) is too much work just for
> "having nice interface". And too much for the only two options we want
> to change.
>
> So, let's just start from something more obvious. Finally,
> we can simple keep block-job-set-speed and block-job-change as is,
> as they (unlike other block-job-* commands) are not duplicated by
> similar job-* commands.
>
> [1] 
> https://patchew.org/QEMU/20240313150907.623462-1-vsement...@yandex-team.ru/
> [2] 
> https://patchew.org/QEMU/20241002140616.561652-1-vsement...@yandex-team.ru/

Thank you for your efforts at making the interfaces cleaner, simpler,
and less redundant.

>  docs/about/deprecated.rst | 31 +++
>  qapi/block-core.json  | 30 ++
>  2 files changed, 61 insertions(+)
>
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index e2b4f077d4..eed3356359 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -148,6 +148,37 @@ options are removed in favor of using explicit 
> ``blockdev-create`` and
>  ``blockdev-add`` calls. See :doc:`/interop/live-block-operations` for
>  details.
>  
> +``block-job-pause`` (since 10.1)
> +'''
> +
> +Use ``job-pause`` instead. The only difference is that ``job-pause``
> +always reports GenericError on failure when ``block-job-pause`` reports
> +DeviceNotActive when block-job is not found.

block-job-pause's doc comment:

# The operation will pause as soon as possible.  No event is emitted
# when the operation is actually paused.  [...]

job-pause's:

# The job will pause as soon as possible, which means transitioning
# into the PAUSED state if it was RUNNING, or into STANDBY if it was
# READY.  The corresponding JOB_STATUS_CHANGE event will be emitted.

Either one of them is incorrect about event emission, or there is
another difference.

> +
> +``block-job-resume`` (since 10.1)
> +
> +
> +Use ``job-resume`` instead. The only difference is that ``job-resume``
> +always reports GenericError on failure when ``block-job-resume`` reports
> +DeviceNotActive when block-job is not found.

block-job-resume's doc comment:

# This command also clears the error status of the job.

Nothing like that job-resume's.  Either one of them is incorrect /
incomplete about the error status clearance, or there is another
difference.

> +
> +``block-job-complete`` (since 10.1)
> +''
> +
> +Use ``job-complete`` instead. The only difference is that ``job-complete``
> +always reports GenericError on failure when ``block-job-complete`` reports
> +DeviceNotActive when block-job is not found.

block-job-complete's doc comment:

# Manually trigger completion of an active background block operation.
# This is supported for drive mirroring, where it also switches the
# device to write to the target path only.  The ability to complete is
# signaled with a BLOCK_JOB_READY event.
#
# This command completes an active background block operation
# synchronously.  The ordering of this command's return with the
# BLOCK_JOB_COMPLETED event is not defined.  Note that if an I/O error
# occurs during the processing of this command: 1) the command itself
# will fail; 2) the error will be processed according to the
# rerror/werror arguments that we

Re: [PATCH v2 2/2] [PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA - update

2025-04-03 Thread Stefan Hajnoczi
On Thu, Apr 03, 2025 at 01:16:33AM -0700, Pinku Deb Nath wrote:
> The testing with "-t writeback" works for turning on enable_write_cache.
> I renamed the function to qemu_pwritev_fua() and fixed any typos.
> 
> I moved the handle_aiocb_flush() into the qemu_pwritev_fua() and
> removed from the previously todo seciont. Initially I thought
> of only passing aiocb, but then I was not sure whethe I could
> derive buf from aiocb, so I added arguments for iovec and iovcnt
> into qemu_pwritev_fua().
> 
> For handling buf in handle_aiocb_rw_linear(), I created iovec
> and passed its reference. I assumed that there will be only one
> buffer/iovec, so I passed 1 for iovcnt.
> 
> Signed-off-by: Pinku Deb Nath 
> ---
>  block/file-posix.c | 38 +-
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 34de816eab..4fffd49318 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1676,12 +1676,24 @@ qemu_pwritev(int fd, const struct iovec *iov, int 
> nr_iov, off_t offset)
>  }
>  
>  static ssize_t
> -qemu_pwrite_fua(int fd, const struct iovec *iov, int nr_iov, off_t offset)
> +qemu_pwritev_fua(const RawPosixAIOData *aiocb, struct iovec *iov, int iovcnt)
>  {
>  #ifdef RWF_DSYNC
> -return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC);
> +return pwritev2(aiocb->aio_fildes,
> +iov,
> +iovcnt,
> +aiocb->aio_offset,
> +RWF_DSYNC);
>  #else
> -return pwritev2(fd, iov, nr_iov, offset, 0);
> +ssize_t len = pwritev2(aiocb->aio_fildes,
> +iov,
> +iovcnt,
> +aiocb->aio_offset,
> +0);

On a non-Linux host pwritev2(2) will not exist. Please take a look at
how qemu_preadv() is integrated (including the !CONFIG_PREADV case) and
decide on a solution that works on non-Linux hosts.

> +if (len == 0) {
> +len = handle_aiocb_flush(aiocb);
> +}
> +return len;
>  #endif
>  }
>  
> @@ -1710,10 +1722,7 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
> *aiocb)
>  len = RETRY_ON_EINTR(
>  (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
>  (aiocb->flags &  BDRV_REQ_FUA) ?
> -qemu_pwrite_fua(aiocb->aio_fildes,
> -aiocb->io.iov,
> -aiocb->io.niov,
> -aiocb->aio_offset) :
> +qemu_pwritev_fua(aiocb, aiocb->io.iov, aiocb->io.niov) :
>  qemu_pwritev(aiocb->aio_fildes,
>  aiocb->io.iov,
>  aiocb->io.niov,
> @@ -1744,10 +1753,11 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
> *aiocb, char *buf)
>  while (offset < aiocb->aio_nbytes) {
>  if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
>  if (aiocb->flags & BDRV_REQ_FUA) {
> -len = qemu_pwrite_fua(aiocb->aio_fildes,
> -aiocb->io.iov,
> -aiocb->io.niov,
> -aiocb->aio_offset);
> +struct iovec iov = {
> +.iov_base = buf,
> +.iov_len = aiocb->aio_nbytes - offset,
> +};
> +len = qemu_pwritev_fua(aiocb, &iov, 1);

The else branch takes offset into account. Here aiocb is passed in
assuming it's the first iteration of the while (offset <
aiocb->aio_nbytes) loop. On subsequent iterations the wrong values will
be used because offset has changed.

Perhaps it's easier to pass in the individual parameters (fd, offset,
etc) instead of passing in aiocb.

>  } else {
>  len = pwrite(aiocb->aio_fildes,
>  (const char *)buf + offset,
> @@ -2567,12 +2577,6 @@ static int coroutine_fn raw_co_prw(BlockDriverState 
> *bs, int64_t *offset_ptr,
>  
>  assert(qiov->size == bytes);
>  ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
> -#ifndef RWD_DSYNC
> -if (ret == 0 && (flags & BDRV_REQ_FUA)) {
> -/* TODO Use pwritev2() instead if it's available */
> -ret = raw_co_flush_to_disk(bs);
> -}
> -#endif
>  goto out; /* Avoid the compiler err of unused label */
>  
>  out:
> -- 
> 2.43.0
> 


signature.asc
Description: PGP signature


Re: [PATCH 14/15] fuse: Implement multi-threading

2025-04-03 Thread Eric Blake
On Wed, Apr 02, 2025 at 09:20:33AM -0400, Stefan Hajnoczi wrote:
> > > Eric: Are you interested in implementing support for multiple IOThreads
> > > in the NBD export? I remember some time ago we talked about NBD
> > > multi-conn support, although maybe that was for the client rather than
> > > the server.
> > 
> > The NBD server already supports clients that make requests through
> > multiple TCP sockets, but right now, that server is not taking
> > advantage of iothreads to spread the TCP load.
> > 
> > And yes, I am in the middle of working on adding client NBD multi-conn
> > support (reviving Rich Jones' preliminary patches on what it would
> > take to have qemu open parallel TCP sockets to a supporting NBD
> > server), which also will use a round-robin approach (but here, the
> > round-robin is something we would code up in qemu, rather than the
> > behavior handed to us by the FUSE kernel layer).  Pinning specific
> > iothreads to a specific TCP socket may or may not make sense, but I
> > definitely want to have support for handing a pool of iothreads to an
> > NBD client that will be using multi-conn.
> 
> Here I'm asking Hanna to make the "iothreads" export parameter generic
> for all export types (not just for FUSE). Do you think that the NBD
> export will be able to use the generic parameter or would you prefer an
> NBD-specific export parameter?

I would prefer to use a generic parameter. NBD will already need a
specific parameter on whether to attempt to use multiple TCP sockets
if the server advertises multi-conn.  But how to map those TCP sockets
to iothreads feels like a better fit for a generic iothreads; and
perhaps the NBD parameter can also be made smart enough to auto-set
the number of TCP sockets based on the number of available iothreads
if there is no explicit override.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org




Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-03 Thread Alex Williamson
On Thu, 3 Apr 2025 11:44:42 -0400
Stefan Hajnoczi  wrote:

> On Thu, Apr 03, 2025 at 09:47:26AM +0200, Niklas Schnelle wrote:
> > On Wed, 2025-04-02 at 11:51 -0400, Stefan Hajnoczi wrote:  
> > > On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:  
> > > > Hi,
> > > > 
> > > > Recently on s390x we have enabled mmap support for vfio-pci devices 
> > > > [1].  
> > > 
> > > Hi Alex,
> > > I wanted to bring this to your attention. Feel free to merge it through
> > > the VFIO tree, otherwise I will merge it once you have taken a look.
> > > 
> > > Thanks,
> > > Stefan
> > >   
> > > > This allows us to take advantage and use userspace drivers on s390x. 
> > > > However,
> > > > on s390x we have special instructions for MMIO access. Starting with 
> > > > z15 
> > > > (and newer platforms) we have new PCI Memory I/O (MIO) instructions 
> > > > which 
> > > > operate on virtually mapped PCI memory spaces, and can be used from 
> > > > userspace.
> > > > On older platforms we would fallback to using existing system calls for 
> > > > MMIO access.
> > > > 
> > > > This patch series introduces support the PCI MIO instructions, and 
> > > > enables s390x
> > > > support for the userspace NVMe driver on s390x. I would appreciate any 
> > > > review/feedback
> > > > on the patches.
> > > > 
> > > > Thanks
> > > > Farhan  
> > 
> > Hi Stefan,
> > 
> > the kernel patch actually made it into Linus' tree for v6.15 already as
> > commit aa9f168d55dc ("s390/pci: Support mmap() of PCI resources except
> > for ISM devices") plus prerequisites. This went via the PCI tree
> > because they included a change to struct pci_dev and also enabled
> > mmap() on PCI resource files. Alex reviewed an earlier version and was
> > the one who suggested to also enable mmap() on PCI resources.  
> 
> The introduction of a new QEMU API for accessing MMIO BARs in this
> series is something Alex might be interested in as QEMU VFIO maintainer.
> That wouldn't have been part of the kernel patch review.
> 
> If he's aware of the new API he can encourage other VFIO users to use it
> in the future so that you won't need to convert them to work on s390x
> again.

I don't claim any jurisdiction over the vfio-nvme driver.  In general
vfio users should be using either vfio_region_ops, ram_device_mem_ops,
or directly mapping MMIO into the VM address space.  The first uses
pread/write through the region offset, irrespective of the type of
memory, the second provides the type of access used here where we're
dereferencing into an mmap, and the last if of course the preferred
mechanism where available.

It is curious that the proposal here doesn't include any changes to
ram_device_mem_ops for more generically enabling MMIO access on s390x.
Thanks,

Alex




Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-03 Thread Farhan Ali



On 4/3/2025 11:05 AM, Alex Williamson wrote:

On Thu, 3 Apr 2025 10:33:52 -0700
Farhan Ali  wrote:


On 4/3/2025 9:27 AM, Alex Williamson wrote:

On Thu, 3 Apr 2025 11:44:42 -0400
Stefan Hajnoczi  wrote:
  

On Thu, Apr 03, 2025 at 09:47:26AM +0200, Niklas Schnelle wrote:

On Wed, 2025-04-02 at 11:51 -0400, Stefan Hajnoczi wrote:

On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:

Hi,

Recently on s390x we have enabled mmap support for vfio-pci devices [1].

Hi Alex,
I wanted to bring this to your attention. Feel free to merge it through
the VFIO tree, otherwise I will merge it once you have taken a look.

Thanks,
Stefan
  

This allows us to take advantage and use userspace drivers on s390x. However,
on s390x we have special instructions for MMIO access. Starting with z15
(and newer platforms) we have new PCI Memory I/O (MIO) instructions which
operate on virtually mapped PCI memory spaces, and can be used from userspace.
On older platforms we would fallback to using existing system calls for MMIO 
access.

This patch series introduces support the PCI MIO instructions, and enables s390x
support for the userspace NVMe driver on s390x. I would appreciate any 
review/feedback
on the patches.

Thanks
Farhan

Hi Stefan,

the kernel patch actually made it into Linus' tree for v6.15 already as
commit aa9f168d55dc ("s390/pci: Support mmap() of PCI resources except
for ISM devices") plus prerequisites. This went via the PCI tree
because they included a change to struct pci_dev and also enabled
mmap() on PCI resource files. Alex reviewed an earlier version and was
the one who suggested to also enable mmap() on PCI resources.

The introduction of a new QEMU API for accessing MMIO BARs in this
series is something Alex might be interested in as QEMU VFIO maintainer.
That wouldn't have been part of the kernel patch review.

If he's aware of the new API he can encourage other VFIO users to use it
in the future so that you won't need to convert them to work on s390x
again.

I don't claim any jurisdiction over the vfio-nvme driver.  In general
vfio users should be using either vfio_region_ops, ram_device_mem_ops,
or directly mapping MMIO into the VM address space.  The first uses
pread/write through the region offset, irrespective of the type of
memory, the second provides the type of access used here where we're
dereferencing into an mmap, and the last if of course the preferred
mechanism where available.

It is curious that the proposal here doesn't include any changes to
ram_device_mem_ops for more generically enabling MMIO access on s390x.
Thanks,

Alex


Hi Alex,
  From my understanding the ram_device_mem_ops sets up the BAR access for
a guest passthrough device. Unfortunately today an s390x KVM guest
doesn't use and have support for these MIO instructions. We wanted to
use this series as an initial test vehicle of the mmap support.

Right, ram_device_mem_ops is what we'll use to access a BAR that
supports mmap but for whatever reason we're accessing it directly
through the mmap.  For instance if an overlapping quirk prevents the
page from being mapped to the VM or we have some back channel mechanism
where the VMM is interacting with the BAR.

I bring it up here because it's effectively the same kind of access
you're adding with these helpers and would need to be addressed if this
were generically enabling vfio mmap access on s390x.


On s390x the use of the MIO instructions is limited to only PCI access. 
So i am not sure if we should generically apply this to all vfio mmap 
access (for non PCI devices).





Prior to commit 2b8fe81b3c2e ("system/memory: use ldn_he_p/stn_he_p")
the mmio helpers here might have been a drop-in replacement for the
dereferencing of mmap offsets, but something would need to be done
about the explicit PCI assumption introduced here and the possibility
of unaligned accesses that the noted commit tries to resolve.  Thanks,

Alex


AFAICT in qemu today the ram_device_mem_ops is used for non PCI vfio 
mmap cases. For s390x these helpers should be restricted to PCI 
accesses. For the unaligned accesses (thanks for pointing out that 
commmit!), are you suggesting we use the ld*_he_p/st*_he_p functions in 
the helpers i defined? Though those functions don't seem to be doing 
volatile accesses.


Thanks

Farhan





Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-03 Thread Alex Williamson
On Thu, 3 Apr 2025 13:33:17 -0700
Farhan Ali  wrote:

> On 4/3/2025 11:05 AM, Alex Williamson wrote:
> > On Thu, 3 Apr 2025 10:33:52 -0700
> > Farhan Ali  wrote:
> >  
> >> On 4/3/2025 9:27 AM, Alex Williamson wrote:  
> >>> On Thu, 3 Apr 2025 11:44:42 -0400
> >>> Stefan Hajnoczi  wrote:
> >>> 
>  On Thu, Apr 03, 2025 at 09:47:26AM +0200, Niklas Schnelle wrote:  
> > On Wed, 2025-04-02 at 11:51 -0400, Stefan Hajnoczi wrote:  
> >> On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:  
> >>> Hi,
> >>>
> >>> Recently on s390x we have enabled mmap support for vfio-pci devices 
> >>> [1].  
> >> Hi Alex,
> >> I wanted to bring this to your attention. Feel free to merge it through
> >> the VFIO tree, otherwise I will merge it once you have taken a look.
> >>
> >> Thanks,
> >> Stefan
> >> 
> >>> This allows us to take advantage and use userspace drivers on s390x. 
> >>> However,
> >>> on s390x we have special instructions for MMIO access. Starting with 
> >>> z15
> >>> (and newer platforms) we have new PCI Memory I/O (MIO) instructions 
> >>> which
> >>> operate on virtually mapped PCI memory spaces, and can be used from 
> >>> userspace.
> >>> On older platforms we would fallback to using existing system calls 
> >>> for MMIO access.
> >>>
> >>> This patch series introduces support the PCI MIO instructions, and 
> >>> enables s390x
> >>> support for the userspace NVMe driver on s390x. I would appreciate 
> >>> any review/feedback
> >>> on the patches.
> >>>
> >>> Thanks
> >>> Farhan  
> > Hi Stefan,
> >
> > the kernel patch actually made it into Linus' tree for v6.15 already as
> > commit aa9f168d55dc ("s390/pci: Support mmap() of PCI resources except
> > for ISM devices") plus prerequisites. This went via the PCI tree
> > because they included a change to struct pci_dev and also enabled
> > mmap() on PCI resource files. Alex reviewed an earlier version and was
> > the one who suggested to also enable mmap() on PCI resources.  
>  The introduction of a new QEMU API for accessing MMIO BARs in this
>  series is something Alex might be interested in as QEMU VFIO maintainer.
>  That wouldn't have been part of the kernel patch review.
> 
>  If he's aware of the new API he can encourage other VFIO users to use it
>  in the future so that you won't need to convert them to work on s390x
>  again.  
> >>> I don't claim any jurisdiction over the vfio-nvme driver.  In general
> >>> vfio users should be using either vfio_region_ops, ram_device_mem_ops,
> >>> or directly mapping MMIO into the VM address space.  The first uses
> >>> pread/write through the region offset, irrespective of the type of
> >>> memory, the second provides the type of access used here where we're
> >>> dereferencing into an mmap, and the last if of course the preferred
> >>> mechanism where available.
> >>>
> >>> It is curious that the proposal here doesn't include any changes to
> >>> ram_device_mem_ops for more generically enabling MMIO access on s390x.
> >>> Thanks,
> >>>
> >>> Alex  
> >>
> >> Hi Alex,
> >>   From my understanding the ram_device_mem_ops sets up the BAR access for
> >> a guest passthrough device. Unfortunately today an s390x KVM guest
> >> doesn't use and have support for these MIO instructions. We wanted to
> >> use this series as an initial test vehicle of the mmap support.  
> > Right, ram_device_mem_ops is what we'll use to access a BAR that
> > supports mmap but for whatever reason we're accessing it directly
> > through the mmap.  For instance if an overlapping quirk prevents the
> > page from being mapped to the VM or we have some back channel mechanism
> > where the VMM is interacting with the BAR.
> >
> > I bring it up here because it's effectively the same kind of access
> > you're adding with these helpers and would need to be addressed if this
> > were generically enabling vfio mmap access on s390x.  
> 
> On s390x the use of the MIO instructions is limited to only PCI access. 
> So i am not sure if we should generically apply this to all vfio mmap 
> access (for non PCI devices).
> 
> 
> >
> > Prior to commit 2b8fe81b3c2e ("system/memory: use ldn_he_p/stn_he_p")
> > the mmio helpers here might have been a drop-in replacement for the
> > dereferencing of mmap offsets, but something would need to be done
> > about the explicit PCI assumption introduced here and the possibility
> > of unaligned accesses that the noted commit tries to resolve.  Thanks,
> >
> > Alex  
> 
> AFAICT in qemu today the ram_device_mem_ops is used for non PCI vfio 
> mmap cases. For s390x these helpers should be restricted to PCI 
> accesses. For the unaligned accesses (thanks for pointing out that 
> commmit!), are you suggesting we use the ld*_he_p/st*_he_p functions in 
> the helpers i defined? Though those functions don't seem to be d

Re: [PATCH v2 0/2] [PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA - update

2025-04-03 Thread Stefan Hajnoczi
On Thu, Apr 03, 2025 at 01:16:31AM -0700, Pinku Deb Nath wrote:
> The testing with "-t writeback" works for turning on enable_write_cache.
> I renamed the function to qemu_pwritev_fua() and fixed any typos.
> 
> I moved the handle_aiocb_flush() into the qemu_pwritev_fua() and
> removed from the previously todo seciont. Initially I thought
> of only passing aiocb, but then I was not sure whethe I could
> derive buf from aiocb, so I added arguments for iovec and iovcnt
> into qemu_pwritev_fua().
> 
> For handling buf in handle_aiocb_rw_linear(), I created iovec
> and passed its reference. I assumed that there will be only one
> buffer/iovec, so I passed 1 for iovcnt.
> 
> Signed-off-by: Pinku Deb Nath 
> 
> Pinku Deb Nath (2):
>   block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA
>   block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA - update
> 
>  block/file-posix.c | 54 +++---
>  1 file changed, 42 insertions(+), 12 deletions(-)

Thanks for sending this updates patch series. Please squash changes in
the future instead of appending them as separate commits. This means
editing previous commits (e.g. git rebase -i master) so that they
contain changes made after code review.

So if commit 1 is '+ printf("foo\n")', then instead of adding commit 2
to add a semi-colon to the end of the line, just edit the commit so it
is '+ printf("foo\n");' in v2 of your patch.

One reason to squash changes is so that git-bisect(1) works. Without
squashing, there will be intermediate commits that are broken and maybe
don't even compile. git-bisect(1) is only usable when each commit
compiles and passes tests.

Reviews also tend to prefer to see the final state of commits so they
don't have to review every incremental edit that was made (often
replacing code they already reviewed). It saves them time.

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-03 Thread Alex Williamson
On Thu, 3 Apr 2025 10:33:52 -0700
Farhan Ali  wrote:

> On 4/3/2025 9:27 AM, Alex Williamson wrote:
> > On Thu, 3 Apr 2025 11:44:42 -0400
> > Stefan Hajnoczi  wrote:
> >  
> >> On Thu, Apr 03, 2025 at 09:47:26AM +0200, Niklas Schnelle wrote:  
> >>> On Wed, 2025-04-02 at 11:51 -0400, Stefan Hajnoczi wrote:  
>  On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:  
> > Hi,
> >
> > Recently on s390x we have enabled mmap support for vfio-pci devices 
> > [1].  
>  Hi Alex,
>  I wanted to bring this to your attention. Feel free to merge it through
>  the VFIO tree, otherwise I will merge it once you have taken a look.
> 
>  Thanks,
>  Stefan
>   
> > This allows us to take advantage and use userspace drivers on s390x. 
> > However,
> > on s390x we have special instructions for MMIO access. Starting with z15
> > (and newer platforms) we have new PCI Memory I/O (MIO) instructions 
> > which
> > operate on virtually mapped PCI memory spaces, and can be used from 
> > userspace.
> > On older platforms we would fallback to using existing system calls for 
> > MMIO access.
> >
> > This patch series introduces support the PCI MIO instructions, and 
> > enables s390x
> > support for the userspace NVMe driver on s390x. I would appreciate any 
> > review/feedback
> > on the patches.
> >
> > Thanks
> > Farhan  
> >>> Hi Stefan,
> >>>
> >>> the kernel patch actually made it into Linus' tree for v6.15 already as
> >>> commit aa9f168d55dc ("s390/pci: Support mmap() of PCI resources except
> >>> for ISM devices") plus prerequisites. This went via the PCI tree
> >>> because they included a change to struct pci_dev and also enabled
> >>> mmap() on PCI resource files. Alex reviewed an earlier version and was
> >>> the one who suggested to also enable mmap() on PCI resources.  
> >> The introduction of a new QEMU API for accessing MMIO BARs in this
> >> series is something Alex might be interested in as QEMU VFIO maintainer.
> >> That wouldn't have been part of the kernel patch review.
> >>
> >> If he's aware of the new API he can encourage other VFIO users to use it
> >> in the future so that you won't need to convert them to work on s390x
> >> again.  
> > I don't claim any jurisdiction over the vfio-nvme driver.  In general
> > vfio users should be using either vfio_region_ops, ram_device_mem_ops,
> > or directly mapping MMIO into the VM address space.  The first uses
> > pread/write through the region offset, irrespective of the type of
> > memory, the second provides the type of access used here where we're
> > dereferencing into an mmap, and the last if of course the preferred
> > mechanism where available.
> >
> > It is curious that the proposal here doesn't include any changes to
> > ram_device_mem_ops for more generically enabling MMIO access on s390x.
> > Thanks,
> >
> > Alex  
> 
> 
> Hi Alex,
>  From my understanding the ram_device_mem_ops sets up the BAR access for 
> a guest passthrough device. Unfortunately today an s390x KVM guest 
> doesn't use and have support for these MIO instructions. We wanted to 
> use this series as an initial test vehicle of the mmap support.

Right, ram_device_mem_ops is what we'll use to access a BAR that
supports mmap but for whatever reason we're accessing it directly
through the mmap.  For instance if an overlapping quirk prevents the
page from being mapped to the VM or we have some back channel mechanism
where the VMM is interacting with the BAR.

I bring it up here because it's effectively the same kind of access
you're adding with these helpers and would need to be addressed if this
were generically enabling vfio mmap access on s390x.

Prior to commit 2b8fe81b3c2e ("system/memory: use ldn_he_p/stn_he_p")
the mmio helpers here might have been a drop-in replacement for the
dereferencing of mmap offsets, but something would need to be done
about the explicit PCI assumption introduced here and the possibility
of unaligned accesses that the noted commit tries to resolve.  Thanks,

Alex




[PATCH v2 2/2] [PATCH] block/file-posix.c: Use pwritev2() with RWF_DSYNC for FUA - update

2025-04-03 Thread Pinku Deb Nath
The testing with "-t writeback" works for turning on enable_write_cache.
I renamed the function to qemu_pwritev_fua() and fixed any typos.

I moved the handle_aiocb_flush() into the qemu_pwritev_fua() and
removed from the previously todo seciont. Initially I thought
of only passing aiocb, but then I was not sure whethe I could
derive buf from aiocb, so I added arguments for iovec and iovcnt
into qemu_pwritev_fua().

For handling buf in handle_aiocb_rw_linear(), I created iovec
and passed its reference. I assumed that there will be only one
buffer/iovec, so I passed 1 for iovcnt.

Signed-off-by: Pinku Deb Nath 
---
 block/file-posix.c | 38 +-
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 34de816eab..4fffd49318 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1676,12 +1676,24 @@ qemu_pwritev(int fd, const struct iovec *iov, int 
nr_iov, off_t offset)
 }
 
 static ssize_t
-qemu_pwrite_fua(int fd, const struct iovec *iov, int nr_iov, off_t offset)
+qemu_pwritev_fua(const RawPosixAIOData *aiocb, struct iovec *iov, int iovcnt)
 {
 #ifdef RWF_DSYNC
-return pwritev2(fd, iov, nr_iov, offset, RWF_DSYNC);
+return pwritev2(aiocb->aio_fildes,
+iov,
+iovcnt,
+aiocb->aio_offset,
+RWF_DSYNC);
 #else
-return pwritev2(fd, iov, nr_iov, offset, 0);
+ssize_t len = pwritev2(aiocb->aio_fildes,
+iov,
+iovcnt,
+aiocb->aio_offset,
+0);
+if (len == 0) {
+len = handle_aiocb_flush(aiocb);
+}
+return len;
 #endif
 }
 
@@ -1710,10 +1722,7 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 len = RETRY_ON_EINTR(
 (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
 (aiocb->flags &  BDRV_REQ_FUA) ?
-qemu_pwrite_fua(aiocb->aio_fildes,
-aiocb->io.iov,
-aiocb->io.niov,
-aiocb->aio_offset) :
+qemu_pwritev_fua(aiocb, aiocb->io.iov, aiocb->io.niov) :
 qemu_pwritev(aiocb->aio_fildes,
 aiocb->io.iov,
 aiocb->io.niov,
@@ -1744,10 +1753,11 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 while (offset < aiocb->aio_nbytes) {
 if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
 if (aiocb->flags & BDRV_REQ_FUA) {
-len = qemu_pwrite_fua(aiocb->aio_fildes,
-aiocb->io.iov,
-aiocb->io.niov,
-aiocb->aio_offset);
+struct iovec iov = {
+.iov_base = buf,
+.iov_len = aiocb->aio_nbytes - offset,
+};
+len = qemu_pwritev_fua(aiocb, &iov, 1);
 } else {
 len = pwrite(aiocb->aio_fildes,
 (const char *)buf + offset,
@@ -2567,12 +2577,6 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, 
int64_t *offset_ptr,
 
 assert(qiov->size == bytes);
 ret = raw_thread_pool_submit(handle_aiocb_rw, &acb);
-#ifndef RWD_DSYNC
-if (ret == 0 && (flags & BDRV_REQ_FUA)) {
-/* TODO Use pwritev2() instead if it's available */
-ret = raw_co_flush_to_disk(bs);
-}
-#endif
 goto out; /* Avoid the compiler err of unused label */
 
 out:
-- 
2.43.0




Re: [PATCH v3 0/3] Enable QEMU NVMe userspace driver on s390x

2025-04-03 Thread Stefan Hajnoczi
On Thu, Apr 03, 2025 at 09:47:26AM +0200, Niklas Schnelle wrote:
> On Wed, 2025-04-02 at 11:51 -0400, Stefan Hajnoczi wrote:
> > On Tue, Apr 01, 2025 at 10:22:43AM -0700, Farhan Ali wrote:
> > > Hi,
> > > 
> > > Recently on s390x we have enabled mmap support for vfio-pci devices [1].
> > 
> > Hi Alex,
> > I wanted to bring this to your attention. Feel free to merge it through
> > the VFIO tree, otherwise I will merge it once you have taken a look.
> > 
> > Thanks,
> > Stefan
> > 
> > > This allows us to take advantage and use userspace drivers on s390x. 
> > > However,
> > > on s390x we have special instructions for MMIO access. Starting with z15 
> > > (and newer platforms) we have new PCI Memory I/O (MIO) instructions which 
> > > operate on virtually mapped PCI memory spaces, and can be used from 
> > > userspace.
> > > On older platforms we would fallback to using existing system calls for 
> > > MMIO access.
> > > 
> > > This patch series introduces support the PCI MIO instructions, and 
> > > enables s390x
> > > support for the userspace NVMe driver on s390x. I would appreciate any 
> > > review/feedback
> > > on the patches.
> > > 
> > > Thanks
> > > Farhan
> 
> Hi Stefan,
> 
> the kernel patch actually made it into Linus' tree for v6.15 already as
> commit aa9f168d55dc ("s390/pci: Support mmap() of PCI resources except
> for ISM devices") plus prerequisites. This went via the PCI tree
> because they included a change to struct pci_dev and also enabled
> mmap() on PCI resource files. Alex reviewed an earlier version and was
> the one who suggested to also enable mmap() on PCI resources.

The introduction of a new QEMU API for accessing MMIO BARs in this
series is something Alex might be interested in as QEMU VFIO maintainer.
That wouldn't have been part of the kernel patch review.

If he's aware of the new API he can encourage other VFIO users to use it
in the future so that you won't need to convert them to work on s390x
again.

Stefan

> 
> Thanks,
> Niklas
> 
> > > 
> > > [1] 
> > > https://lore.kernel.org/linux-s390/20250226-vfio_pci_mmap-v7-0-c5c0f1d26...@linux.ibm.com/
> > > 
> > > ChangeLog
> > > -
> > > v2 series 
> > > https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06847.html
> > > v2 -> v3
> > > - Update the PCI MMIO APIs to reflect that its PCI MMIO access on 
> > > host 
> > > as suggested by Stefan(patch 2)
> > > - Move s390x ifdef check to s390x_pci_mmio.h as suggested by Philippe 
> > > (patch 1)
> > > - Add R-bs for the respective patches.
> > > 
> > > v1 series 
> > > https://mail.gnu.org/archive/html/qemu-devel/2025-03/msg06596.html
> > > v1 -> v2
> > > - Add 8 and 16 bit reads/writes for completeness (patch 1)
> > > - Introduce new QEMU PCI MMIO read/write API as suggested by Stefan 
> > > (patch 2)
> > > - Update NVMe userspace driver to use QEMU PCI MMIO functions (patch 
> > > 3)
> > > 
> > > Farhan Ali (3):
> > >   util: Add functions for s390x mmio read/write
> > >   include: Add a header to define host PCI MMIO functions
> > >   block/nvme: Use host PCI MMIO API
> > > 
> > >  block/nvme.c  |  37 +
> > >  include/qemu/host-pci-mmio.h  | 116 ++
> > >  include/qemu/s390x_pci_mmio.h |  24 ++
> > >  util/meson.build  |   2 +
> > >  util/s390x_pci_mmio.c | 148 ++
> > >  5 files changed, 311 insertions(+), 16 deletions(-)
> > >  create mode 100644 include/qemu/host-pci-mmio.h
> > >  create mode 100644 include/qemu/s390x_pci_mmio.h
> > >  create mode 100644 util/s390x_pci_mmio.c
> > > 
> > > -- 
> > > 2.43.0
> > > 
> 


signature.asc
Description: PGP signature