from:"Stefan Hajnoczi"

Re: [Xen-devel] [RFC v2 1/4] elf: Add optional function ptr to load_elf() to parse ELF notes

2019-01-02 Thread Stefan Hajnoczi

On Fri, Dec 21, 2018 at 08:03:49PM +, Liam Merwick wrote:
> diff --git a/include/hw/elf_ops.h b/include/hw/elf_ops.h
> index 74679ff8da3a..37d20a3800c1 100644
> --- a/include/hw/elf_ops.h
> +++ b/include/hw/elf_ops.h
> @@ -266,6 +266,7 @@ fail:
>  }
>  
>  static int glue(load_elf, SZ)(const char *name, int fd,
> +  uint64_t (*elf_note_fn)(void *, void *, bool),
>uint64_t (*translate_fn)(void *, uint64_t),
>void *translate_opaque,
>int must_swab, uint64_t *pentry,
> @@ -496,8 +497,30 @@ static int glue(load_elf, SZ)(const char *name, int fd,
>  high = addr + mem_size;
>  
>  data = NULL;
> +
> +} else if (ph->p_type == PT_NOTE && elf_note_fn) {
> +struct elf_note *nhdr = NULL;
> +
> +file_size = ph->p_filesz; /* Size of the range of ELF notes */
> +data = g_malloc0(file_size);
> +if (ph->p_filesz > 0) {
> +if (lseek(fd, ph->p_offset, SEEK_SET) < 0) {
> +goto fail;
> +}
> +if (read(fd, data, file_size) != file_size) {
> +goto fail;
> +}
> +}
> +
> +if (nhdr != NULL) {
> +bool is64 =
> +sizeof(struct elf_note) == sizeof(struct elf64_note);
> +elf_note_fn((void *)nhdr, (void *)&ph->p_align, is64);

How does data get used?

> +}
> +g_free(data);

Missing data = NULL to prevent double free later?


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC v2 2/4] elf-ops.h: Add get_elf_note_type()

2019-01-02 Thread Stefan Hajnoczi

On Fri, Dec 21, 2018 at 08:03:50PM +, Liam Merwick wrote:
> +while (note_type != elf_note_type) {
> +nhdr_namesz = nhdr->n_namesz;
> +nhdr_descsz = nhdr->n_descsz;
> +
> +elf_note_entry_offset = nhdr_size +
> +QEMU_ALIGN_UP(nhdr_namesz, phdr_align) +
> +QEMU_ALIGN_UP(nhdr_descsz, phdr_align);
> +
> +/* If the offset calculated in this iteration exceeds the
> +  * supplied size, we are done and no matching note was found.
> +  */

Indentation is off here.  QEMU uses 4-space indentation.

> +if (elf_note_entry_offset > note_size) {
> +return NULL;
> +}
> +
> +/* skip to the next ELF Note entry */
> +nhdr = (void *)nhdr + elf_note_entry_offset;
> +note_type = nhdr->n_type;
> +}
> +
> +return nhdr;
> +}
> +
>  static int glue(load_elf, SZ)(const char *name, int fd,
>uint64_t (*elf_note_fn)(void *, void *, bool),
>uint64_t (*translate_fn)(void *, uint64_t),
> @@ -512,6 +555,13 @@ static int glue(load_elf, SZ)(const char *name, int fd,
>  }
>  }
>  
> + /* Search the ELF notes to find one with a type matching the
> +  * value passed in via 'translate_opaque'
> +  */
> +nhdr = (struct elf_note *)data;

Ah, I see data gets used here!  It would be clearer to move loading of
data into this patch.

> + assert(translate_opaque != NULL);
> +nhdr = glue(get_elf_note_type, SZ)(nhdr, file_size, ph->p_align,
> +   *(uint64_t 
> *)translate_opaque);

Indentation is off in this hunk.  QEMU uses 4-space indentation.


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC v2 4/4] pvh: Boot uncompressed kernel using direct boot ABI

2019-01-02 Thread Stefan Hajnoczi

On Fri, Dec 21, 2018 at 08:03:52PM +, Liam Merwick wrote:
> @@ -1336,7 +1470,7 @@ void pc_memory_init(PCMachineState *pcms,
>  int linux_boot, i;
>  MemoryRegion *ram, *option_rom_mr;
>  MemoryRegion *ram_below_4g, *ram_above_4g;
> -FWCfgState *fw_cfg;
> +FWCfgState *fw_cfg = NULL;

What is the purpose of this change?


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Qemu-devel] [PATCH v3 0/4] QEMU changes to do PVH boot

2019-01-21 Thread Stefan Hajnoczi

On Mon, Jan 21, 2019 at 08:19:03AM +, Liam Merwick wrote:
> On 21/01/2019 02:31, no-re...@patchew.org wrote:
> > Patchew URL: 
> > https://patchew.org/QEMU/1547554687-12687-1-git-send-email-liam.merw...@oracle.com/
> ...>
> >CC  dma-helpers.o
> >CC  vl.o
> > /tmp/qemu-test/src/block/sheepdog.c: In function 'find_vdi_name':
> > /tmp/qemu-test/src/block/sheepdog.c:1239:5: error: 'strncpy' specified 
> > bound 256 equals destination size [-Werror=stringop-truncation]
> >   strncpy(buf + SD_MAX_VDI_LEN, tag, SD_MAX_VDI_TAG_LEN);
> >   ^~
> > cc1: all warnings being treated as errors
> > 
> 
> 
> Given the PVH patch series was posted 5 days ago and the following change
> was committed 3 days ago, I'm assuming this is not related to the PVH
> changes (which do not touch this file).
> 
> commit 97b583f46c435aaa40942ca73739d79190776b7f
> Author: Philippe Mathieu-Daudé 
> Date:   Thu Jan 3 09:56:35 2019 +0100
> 
> block/sheepdog: Use QEMU_NONSTRING for non NUL-terminated arrays

Yes, don't worry, it's a false positive.

Stefan


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PULL 0/7] Tracing patches

2019-03-25 Thread Stefan Hajnoczi

The following changes since commit d97a39d903fe33c45be83ac6943a2f82a3649a11:

  Merge remote-tracking branch 'remotes/ehabkost/tags/x86-next-pull-request' 
into staging (2019-03-22 09:37:38 +)

are available in the Git repository at:

  git://github.com/stefanha/qemu.git tags/tracing-pull-request

for you to fetch changes up to dec9776049e32d6c830127b286530c5f53267eff:

  trace-events: Fix attribution of trace points to source (2019-03-22 16:18:07 
+)


Pull request

Compilation fixes and cleanups for QEMU 4.0.0.



Markus Armbruster (5):
  trace-events: Consistently point to docs/devel/tracing.txt
  trace-events: Shorten file names in comments
  scripts/cleanup-trace-events: Update for current practice
  trace-events: Delete unused trace points
  trace-events: Fix attribution of trace points to source

Stefan Hajnoczi (2):
  trace: handle tracefs path truncation
  trace: avoid SystemTap dtrace(1) warnings on empty files

 trace/ftrace.c  | 12 +-
 accel/kvm/trace-events  |  2 +-
 accel/tcg/trace-events  |  2 +-
 audio/trace-events  |  6 +--
 authz/trace-events  | 10 ++---
 block/trace-events  | 49 +++
 chardev/trace-events|  4 +-
 crypto/trace-events | 10 ++---
 hw/9pfs/trace-events|  2 +-
 hw/acpi/trace-events|  6 +--
 hw/alpha/trace-events   |  2 +-
 hw/arm/trace-events | 17 +++-
 hw/audio/trace-events   |  6 +--
 hw/block/dataplane/trace-events |  2 +-
 hw/block/trace-events   | 15 ---
 hw/char/trace-events| 24 +--
 hw/display/trace-events | 28 ++---
 hw/dma/trace-events |  6 +--
 hw/gpio/trace-events|  2 +-
 hw/hppa/trace-events|  2 +-
 hw/i2c/trace-events |  2 +-
 hw/i386/trace-events| 10 ++---
 hw/i386/xen/trace-events|  6 ++-
 hw/ide/trace-events | 23 +--
 hw/input/trace-events   | 16 
 hw/intc/trace-events| 35 -
 hw/isa/trace-events |  4 +-
 hw/mem/trace-events |  4 +-
 hw/misc/macio/trace-events  |  9 ++---
 hw/misc/trace-events| 40 +--
 hw/net/trace-events | 42 ++--
 hw/nvram/trace-events   |  4 +-
 hw/pci-host/trace-events|  6 +--
 hw/pci/trace-events |  6 +--
 hw/ppc/trace-events | 40 +--
 hw/rdma/trace-events|  6 +--
 hw/rdma/vmw/trace-events|  8 ++--
 hw/s390x/trace-events   |  4 +-
 hw/scsi/trace-events| 22 +--
 hw/sd/trace-events  | 13 +++---
 hw/sparc/trace-events   |  6 +--
 hw/sparc64/trace-events |  6 +--
 hw/timer/trace-events   | 24 +--
 hw/tpm/trace-events | 12 +++---
 hw/usb/trace-events | 22 +--
 hw/vfio/trace-events| 15 ---
 hw/virtio/trace-events  | 10 ++---
 hw/watchdog/trace-events|  2 +-
 hw/xen/trace-events |  6 +--
 io/trace-events | 12 +++---
 linux-user/trace-events |  3 +-
 migration/trace-events  | 70 ++---
 nbd/trace-events| 10 ++---
 net/trace-events| 10 ++---
 qapi/trace-events   |  4 +-
 qom/trace-events|  2 +-
 scripts/cleanup-trace-events.pl | 19 ++---
 scripts/tracetool/format/d.py   |  5 +++
 scsi/trace-events   |  4 +-
 target/arm/trace-events |  4 +-
 target/hppa/trace-events|  4 +-
 target/i386/trace-events|  4 +-
 target/mips/trace-events|  2 +-
 target/ppc/trace-events |  2 +-
 target/s390x/trace-events   | 10 ++---
 target/sparc/trace-events   |  8 ++--
 trace-events| 13 +-
 ui/trace-events | 19 +
 util/trace-events   | 28 ++---
 69 files changed, 438 insertions(+), 405 deletions(-)

-- 
2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PULL 6/7] trace-events: Delete unused trace points

2019-03-25 Thread Stefan Hajnoczi

From: Markus Armbruster 

Tracked down with cleanup-trace-events.pl.  Funnies requiring manual
post-processing:

* block.c and blockdev.c trace points are in block/trace-events.

* hw/block/nvme.c uses the preprocessor to hide its trace point use
  from cleanup-trace-events.pl.

* include/hw/xen/xen_common.h trace points are in hw/xen/trace-events.

* net/colo-compare and net/filter-rewriter.c use pseudo trace points
  colo_compare_udp_miscompare and colo_filter_rewriter_debug to guard
  debug code.

Signed-off-by: Markus Armbruster 
Message-id: 20190314180929.27722-5-arm...@redhat.com
Message-Id: <20190314180929.27722-5-arm...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 block/trace-events | 1 -
 hw/arm/trace-events| 7 ---
 hw/block/trace-events  | 2 --
 hw/display/trace-events| 1 -
 hw/i386/trace-events   | 2 --
 hw/ide/trace-events| 1 -
 hw/intc/trace-events   | 1 -
 hw/misc/macio/trace-events | 1 -
 hw/misc/trace-events   | 2 --
 hw/ppc/trace-events| 8 
 hw/sd/trace-events | 1 -
 hw/vfio/trace-events   | 1 -
 nbd/trace-events   | 2 --
 util/trace-events  | 2 --
 14 files changed, 32 deletions(-)

diff --git a/block/trace-events b/block/trace-events
index 28b6364f28..e6bb5a8f05 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -57,7 +57,6 @@ qmp_block_stream(void *bs, void *job) "bs %p job %p"
 
 # file-posix.c
 # file-win32.c
-file_paio_submit_co(int64_t offset, int count, int type) "offset %"PRId64" 
count %d type %d"
 file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) 
"acb %p opaque %p offset %"PRId64" count %d type %d"
 file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t 
dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset 
%"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
 
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 3e91ca37a9..441d12df5e 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -5,8 +5,6 @@ virt_acpi_setup(void) "No fw cfg or ACPI disabled. Bailing out."
 
 # smmu-common.c
 smmu_add_mr(const char *name) "%s"
-smmu_page_walk(int stage, uint64_t baseaddr, int first_level, uint64_t start, 
uint64_t end) "stage=%d, baseaddr=0x%"PRIx64", first level=%d, 
start=0x%"PRIx64", end=0x%"PRIx64
-smmu_lookup_table(int level, uint64_t baseaddr, int granule_sz, uint64_t 
start, uint64_t end, int flags, uint64_t subpage_size) "level=%d 
baseaddr=0x%"PRIx64" granule=%d, start=0x%"PRIx64" end=0x%"PRIx64" flags=%d 
subpage_size=0x%"PRIx64
 smmu_ptw_level(int level, uint64_t iova, size_t subpage_size, uint64_t 
baseaddr, uint32_t offset, uint64_t pte) "level=%d iova=0x%"PRIx64" 
subpage_sz=0x%zx baseaddr=0x%"PRIx64" offset=%d => pte=0x%"PRIx64
 smmu_ptw_invalid_pte(int stage, int level, uint64_t baseaddr, uint64_t 
pteaddr, uint32_t offset, uint64_t pte) "stage=%d level=%d base@=0x%"PRIx64" 
pte@=0x%"PRIx64" offset=%d pte=0x%"PRIx64
 smmu_ptw_page_pte(int stage, int level,  uint64_t iova, uint64_t baseaddr, 
uint64_t pteaddr, uint64_t pte, uint64_t address) "stage=%d level=%d 
iova=0x%"PRIx64" base@=0x%"PRIx64" pte@=0x%"PRIx64" pte=0x%"PRIx64" page 
address = 0x%"PRIx64
@@ -29,12 +27,7 @@ smmuv3_cmdq_consume(uint32_t prod, uint32_t cons, uint8_t 
prod_wrap, uint8_t con
 smmuv3_cmdq_opcode(const char *opcode) "<--- %s"
 smmuv3_cmdq_consume_out(uint32_t prod, uint32_t cons, uint8_t prod_wrap, 
uint8_t cons_wrap) "prod:%d, cons:%d, prod_wrap:%d, cons_wrap:%d "
 smmuv3_cmdq_consume_error(const char *cmd_name, uint8_t cmd_error) "Error on 
%s command execution: %d"
-smmuv3_update(bool is_empty, uint32_t prod, uint32_t cons, uint8_t prod_wrap, 
uint8_t cons_wrap) "q empty:%d prod:%d cons:%d p.wrap:%d p.cons:%d"
-smmuv3_update_check_cmd(int error) "cmdq not enabled or error :0x%x"
 smmuv3_write_mmio(uint64_t addr, uint64_t val, unsigned size, uint32_t r) 
"addr: 0x%"PRIx64" val:0x%"PRIx64" size: 0x%x(%d)"
-smmuv3_write_mmio_idr(uint64_t addr, uint64_t val) "write to RO/Unimpl reg 
0x%"PRIx64" val64:0x%"PRIx64
-smmuv3_write_mmio_evtq_cons_bef_clear(uint32_t prod, uint32_t cons, uint8_t 
prod_wrap, uint8_t cons_wrap) "Before clearing interrupt prod:0x%x cons:0x%x 
prod.w:%d cons.w:%d"
-smmuv3_write_mmio_evtq_cons_after_clear(uint32_t prod, uint32_t cons, uint8_t 
prod_wrap, uint8_t cons_wrap) "after clearing interrupt prod:0x%x cons:0x%x 
prod.w:%d cons.w:%d"
 smmuv3_record_event(const char *type, uint32_t sid) "%s sid=%d"
 smmuv3_find_ste(uint16_t sid, uint32_t f

[Xen-devel] [PULL 2/7] trace: avoid SystemTap dtrace(1) warnings on empty files

2019-03-25 Thread Stefan Hajnoczi

target/hppa/trace-events only contains disabled events, resulting in a
trace-dtrace.dtrace file that says "provider qemu {}".  SystemTap's
dtrace(1) tool prints a warning when processing this input file.

This patch avoids the error by emitting an empty file instead of
"provider qemu {}" when there are no enabled trace events.

Fixes: 23c3d569f44284066714ff7c46bc4f19e630583f ("target/hppa: add TLB trace 
events")
Reported-by: Markus Armbruster 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Markus Armbruster 
Reviewed-by: Liam Merwick 
Message-id: 20190321170831.6539-3-stefa...@redhat.com
Message-Id: <20190321170831.6539-3-stefa...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 scripts/tracetool/format/d.py | 5 +
 1 file changed, 5 insertions(+)

diff --git a/scripts/tracetool/format/d.py b/scripts/tracetool/format/d.py
index 78397c24d2..c7cb2a93a6 100644
--- a/scripts/tracetool/format/d.py
+++ b/scripts/tracetool/format/d.py
@@ -33,6 +33,11 @@ def generate(events, backend, group):
 events = [e for e in events
   if "disable" not in e.properties]
 
+# SystemTap's dtrace(1) warns about empty "provider qemu {}" but is happy
+# with an empty file.  Avoid the warning.
+if not events:
+return
+
 out('/* This file is autogenerated by tracetool, do not edit. */'
 '',
 'provider qemu {')
-- 
2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PULL 1/7] trace: handle tracefs path truncation

2019-03-25 Thread Stefan Hajnoczi

If the tracefs mountpoint has a very long path we may exceed PATH_MAX.
This is a system misconfiguration and the user must resolve it so that
applications can perform path-based system calls successfully.

This issue does not occur on real-world systems since tracefs is mounted
on /sys/kernel/debug/tracing/, but the compiler is smart enough to
foresee the possibility and warn about the unchecked snprintf(3) return
value.  This patch fixes the compiler warning.

Reported-by: Markus Armbruster 
Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Markus Armbruster 
Reviewed-by: Liam Merwick 
Message-id: 20190321170831.6539-2-stefa...@redhat.com
Message-Id: <20190321170831.6539-2-stefa...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 trace/ftrace.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/trace/ftrace.c b/trace/ftrace.c
index 61692a8682..9749543d9b 100644
--- a/trace/ftrace.c
+++ b/trace/ftrace.c
@@ -53,7 +53,11 @@ bool ftrace_init(void)
 }
 
 if (tracefs_found) {
-snprintf(path, PATH_MAX, "%s%s/tracing_on", mount_point, subdir);
+if (snprintf(path, PATH_MAX, "%s%s/tracing_on", mount_point, subdir)
+>= sizeof(path)) {
+fprintf(stderr, "Using tracefs mountpoint would exceed 
PATH_MAX\n");
+return false;
+}
 trace_fd = open(path, O_WRONLY);
 if (trace_fd < 0) {
 if (errno == EACCES) {
@@ -72,7 +76,11 @@ bool ftrace_init(void)
 }
 close(trace_fd);
 }
-snprintf(path, PATH_MAX, "%s%s/trace_marker", mount_point, subdir);
+if (snprintf(path, PATH_MAX, "%s%s/trace_marker", mount_point, subdir)
+>= sizeof(path)) {
+fprintf(stderr, "Using tracefs mountpoint would exceed 
PATH_MAX\n");
+return false;
+}
 trace_marker_fd = open(path, O_WRONLY);
 if (trace_marker_fd < 0) {
 perror("Could not open ftrace 'trace_marker' file");
-- 
2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PULL 3/7] trace-events: Consistently point to docs/devel/tracing.txt

2019-03-25 Thread Stefan Hajnoczi

From: Markus Armbruster 

Almost all trace-events point to docs/devel/tracing.txt in a comment
right at the beginning.  Touch up the ones that don't.

[Updated with Markus' new commit description wording.
--Stefan]

Signed-off-by: Markus Armbruster 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 20190314180929.27722-2-arm...@redhat.com
Message-Id: <20190314180929.27722-2-arm...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 accel/kvm/trace-events   | 2 +-
 accel/tcg/trace-events   | 2 +-
 hw/i386/xen/trace-events | 2 ++
 nbd/trace-events | 2 ++
 qapi/trace-events| 2 ++
 scsi/trace-events| 2 ++
 trace-events | 2 +-
 7 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 8841025d68..33c5b1b3af 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -1,4 +1,4 @@
-# Trace events for debugging and performance instrumentation
+# See docs/devel/tracing.txt for syntax documentation.
 
 # kvm-all.c
 kvm_ioctl(int type, void *arg) "type 0x%x, arg %p"
diff --git a/accel/tcg/trace-events b/accel/tcg/trace-events
index c22ad60af7..01852217a6 100644
--- a/accel/tcg/trace-events
+++ b/accel/tcg/trace-events
@@ -1,4 +1,4 @@
-# Trace events for debugging and performance instrumentation
+# See docs/devel/tracing.txt for syntax documentation.
 
 # TCG related tracing (mostly disabled by default)
 # cpu-exec.c
diff --git a/hw/i386/xen/trace-events b/hw/i386/xen/trace-events
index 8a9077cd4e..8732741541 100644
--- a/hw/i386/xen/trace-events
+++ b/hw/i386/xen/trace-events
@@ -1,3 +1,5 @@
+# See docs/devel/tracing.txt for syntax documentation.
+
 # hw/i386/xen/xen_platform.c
 xen_platform_log(char *s) "xen platform: %s"
 
diff --git a/nbd/trace-events b/nbd/trace-events
index 7f10ebd4e0..6db8375c3e 100644
--- a/nbd/trace-events
+++ b/nbd/trace-events
@@ -1,3 +1,5 @@
+# See docs/devel/tracing.txt for syntax documentation.
+
 # nbd/client.c
 nbd_send_option_request(uint32_t opt, const char *name, uint32_t len) "Sending 
option request %" PRIu32" (%s), len %" PRIu32
 nbd_receive_option_reply(uint32_t option, const char *optname, uint32_t type, 
const char *typename, uint32_t length) "Received option reply %" PRIu32" (%s), 
type %" PRIu32" (%s), len %" PRIu32
diff --git a/qapi/trace-events b/qapi/trace-events
index 70e049ea80..b123c5e302 100644
--- a/qapi/trace-events
+++ b/qapi/trace-events
@@ -1,3 +1,5 @@
+# See docs/devel/tracing.txt for syntax documentation.
+
 # qapi/qapi-visit-core.c
 visit_free(void *v) "v=%p"
 visit_complete(void *v, void *opaque) "v=%p opaque=%p"
diff --git a/scsi/trace-events b/scsi/trace-events
index f8a68b11eb..499098e50b 100644
--- a/scsi/trace-events
+++ b/scsi/trace-events
@@ -1,3 +1,5 @@
+# See docs/devel/tracing.txt for syntax documentation.
+
 # scsi/pr-manager.c
 pr_manager_execute(int fd, int cmd, int sa) "fd=%d cmd=0x%02x service 
action=0x%02x"
 pr_manager_run(int fd, int cmd, int sa) "fd=%d cmd=0x%02x service 
action=0x%02x"
diff --git a/trace-events b/trace-events
index e66afc59e9..b48f417225 100644
--- a/trace-events
+++ b/trace-events
@@ -1,4 +1,4 @@
-# Trace events for debugging and performance instrumentation
+# See docs/devel/tracing.txt for syntax documentation.
 #
 # This file is processed by the tracetool script during the build.
 #
-- 
2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PULL 5/7] scripts/cleanup-trace-events: Update for current practice

2019-03-25 Thread Stefan Hajnoczi

From: Markus Armbruster 

Emit comments with shortened file names (previous commit).

Limit search to the input file's directory.

Cope with properties tcg (commit b2b36c22bd8) and vcpu (commit
3d211d9f4db).

Cope with capital letters in function names.

Signed-off-by: Markus Armbruster 
Message-id: 20190314180929.27722-4-arm...@redhat.com
Message-Id: <20190314180929.27722-4-arm...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 scripts/cleanup-trace-events.pl | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/scripts/cleanup-trace-events.pl b/scripts/cleanup-trace-events.pl
index e93abc00da..d4f0e4cab5 100755
--- a/scripts/cleanup-trace-events.pl
+++ b/scripts/cleanup-trace-events.pl
@@ -13,6 +13,7 @@
 
 use warnings;
 use strict;
+use File::Basename;
 
 my $buf = '';
 my %seen = ();
@@ -23,12 +24,19 @@ sub out {
 %seen = ();
 }
 
-while (<>) {
-if (/^(disable )?([a-z_0-9]+)\(/) {
-open GREP, '-|', 'git', 'grep', '-lw', "trace_$2"
+$#ARGV == 0 or die "usage: $0 FILE";
+my $in = $ARGV[0];
+my $dir = dirname($in);
+open(IN, $in) or die "open $in: $!";
+chdir($dir) or die "chdir $dir: $!";
+
+while () {
+if (/^(disable |(tcg) |vcpu )*([a-z_0-9]+)\(/i) {
+my $pat = "trace_$3";
+$pat .= '_tcg' if (defined $2);
+open GREP, '-|', 'git', 'grep', '-lw', '--max-depth', '1', $pat
 or die "run git grep: $!";
-my $fname;
-while ($fname = ) {
+while (my $fname = ) {
 chomp $fname;
 next if $seen{$fname} || $fname eq 'trace-events';
 $seen{$fname} = 1;
@@ -49,3 +57,4 @@ while (<>) {
 }
 
 out;
+close(IN) or die "close $in: $!";
-- 
2.20.1


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PULL 7/7] trace-events: Fix attribution of trace points to source

2019-03-25 Thread Stefan Hajnoczi

From: Markus Armbruster 

Some trace points are attributed to the wrong source file.  Happens
when we neglect to update trace-events for code motion, or add events
in the wrong place, or misspell the file name.

Clean up with help of cleanup-trace-events.pl.  Same funnies as in the
previous commit, of course.  Manually shorten its change to
linux-user/trace-events to */signal.c.

Signed-off-by: Markus Armbruster 
Message-id: 20190314180929.27722-6-arm...@redhat.com
Message-Id: <20190314180929.27722-6-arm...@redhat.com>
Signed-off-by: Stefan Hajnoczi 
---
 authz/trace-events   |  2 +-
 hw/9pfs/trace-events |  2 +-
 hw/arm/trace-events  |  4 ++--
 hw/block/trace-events|  3 ++-
 hw/char/trace-events |  2 +-
 hw/display/trace-events  |  3 ++-
 hw/ide/trace-events  |  6 --
 hw/input/trace-events|  2 +-
 hw/misc/trace-events |  8 +---
 hw/net/trace-events  |  8 
 hw/ppc/trace-events  |  6 +-
 hw/timer/trace-events|  4 ++--
 hw/vfio/trace-events |  2 +-
 hw/watchdog/trace-events |  2 +-
 linux-user/trace-events  |  1 +
 migration/trace-events   | 44 
 trace-events | 11 ++
 ui/trace-events  |  5 +
 util/trace-events|  6 +++---
 19 files changed, 78 insertions(+), 43 deletions(-)

diff --git a/authz/trace-events b/authz/trace-events
index 5cb577061c..e62ebb36b7 100644
--- a/authz/trace-events
+++ b/authz/trace-events
@@ -14,5 +14,5 @@ qauthz_list_default_policy(void *authz, const char *identity, 
int policy) "AuthZ
 qauthz_list_file_load(void *authz, const char *filename) "AuthZ file %p load 
filename=%s"
 qauthz_list_file_refresh(void *authz, const char *filename, int success) 
"AuthZ file %p load filename=%s success=%d"
 
-# pam.c
+# pamacct.c
 qauthz_pam_check(void *authz, const char *identity, const char *service) 
"AuthZ PAM %p identity=%s service=%s"
diff --git a/hw/9pfs/trace-events b/hw/9pfs/trace-events
index 0c14bda178..c0a0a4ab5d 100644
--- a/hw/9pfs/trace-events
+++ b/hw/9pfs/trace-events
@@ -1,6 +1,6 @@
 # See docs/devel/tracing.txt for syntax documentation.
 
-# virtio-9p.c
+# 9p.c
 v9fs_rcancel(uint16_t tag, uint8_t id) "tag %d id %d"
 v9fs_rerror(uint16_t tag, uint8_t id, int err) "tag %d id %d err %d"
 v9fs_version(uint16_t tag, uint8_t id, int32_t msize, char* version) "tag %d 
id %d msize %d version %s"
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 441d12df5e..0acedcedc6 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -10,8 +10,6 @@ smmu_ptw_invalid_pte(int stage, int level, uint64_t baseaddr, 
uint64_t pteaddr,
 smmu_ptw_page_pte(int stage, int level,  uint64_t iova, uint64_t baseaddr, 
uint64_t pteaddr, uint64_t pte, uint64_t address) "stage=%d level=%d 
iova=0x%"PRIx64" base@=0x%"PRIx64" pte@=0x%"PRIx64" pte=0x%"PRIx64" page 
address = 0x%"PRIx64
 smmu_ptw_block_pte(int stage, int level, uint64_t baseaddr, uint64_t pteaddr, 
uint64_t pte, uint64_t iova, uint64_t gpa, int bsize_mb) "stage=%d level=%d 
base@=0x%"PRIx64" pte@=0x%"PRIx64" pte=0x%"PRIx64" iova=0x%"PRIx64" block 
address = 0x%"PRIx64" block size = %d MiB"
 smmu_get_pte(uint64_t baseaddr, int index, uint64_t pteaddr, uint64_t pte) 
"baseaddr=0x%"PRIx64" index=0x%x, pteaddr=0x%"PRIx64", pte=0x%"PRIx64
-smmu_iotlb_cache_hit(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t 
miss, uint32_t p) "IOTLB cache HIT asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit 
rate=%d"
-smmu_iotlb_cache_miss(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t 
miss, uint32_t p) "IOTLB cache MISS asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit 
rate=%d"
 smmu_iotlb_inv_all(void) "IOTLB invalidate all"
 smmu_iotlb_inv_asid(uint16_t asid) "IOTLB invalidate asid=%d"
 smmu_iotlb_inv_iova(uint16_t asid, uint64_t addr) "IOTLB invalidate asid=%d 
addr=0x%"PRIx64
@@ -48,6 +46,8 @@ smmuv3_cmdq_tlbi_nh_va(int vmid, int asid, uint64_t addr, 
bool leaf) "vmid =%d a
 smmuv3_cmdq_tlbi_nh_vaa(int vmid, uint64_t addr) "vmid =%d addr=0x%"PRIx64
 smmuv3_cmdq_tlbi_nh(void) ""
 smmuv3_cmdq_tlbi_nh_asid(uint16_t asid) "asid=%d"
+smmu_iotlb_cache_hit(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t 
miss, uint32_t p) "IOTLB cache HIT asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit 
rate=%d"
+smmu_iotlb_cache_miss(uint16_t asid, uint64_t addr, uint32_t hit, uint32_t 
miss, uint32_t p) "IOTLB cache MISS asid=%d addr=0x%"PRIx64" hit=%d miss=%d hit 
rate=%d"
 smmuv3_config_cache_inv(uint32_t sid) "Config cache INV for sid %d"
 smmuv3_notify_flag_add(const char *iommu) "ADD SMMUNotifier node for iommu 
mr=%s"
 smmuv3_notify_flag_del

Re: [Xen-devel] [Qemu-block] [PATCH 2/3] xen-bus: allow AioContext to be specified for each event channel

2019-04-15 Thread Stefan Hajnoczi

On Wed, Apr 10, 2019 at 03:20:05PM +, Paul Durrant wrote:
> > -Original Message-
> > From: Anthony PERARD [mailto:anthony.per...@citrix.com]
> > Sent: 10 April 2019 13:57
> > To: Paul Durrant 
> > Cc: qemu-de...@nongnu.org; qemu-bl...@nongnu.org; 
> > xen-devel@lists.xenproject.org; Stefano Stabellini
> > ; Stefan Hajnoczi ; Kevin Wolf 
> > ; Max
> > Reitz 
> > Subject: Re: [PATCH 2/3] xen-bus: allow AioContext to be specified for each 
> > event channel
> > 
> > On Mon, Apr 08, 2019 at 04:16:16PM +0100, Paul Durrant wrote:
> > > This patch adds an AioContext parameter to xen_device_bind_event_channel()
> > > and then uses aio_set_fd_handler() to set the callback rather than
> > > qemu_set_fd_handler().
> > >
> > > Signed-off-by: Paul Durrant 
> > > ---
> > > @@ -943,6 +944,7 @@ static void xen_device_event(void *opaque)
> > >  }
> > >
> > >  XenEventChannel *xen_device_bind_event_channel(XenDevice *xendev,
> > > +   AioContext *ctx,
> > > unsigned int port,
> > > XenEventHandler handler,
> > > void *opaque, Error 
> > > **errp)
> > > @@ -968,8 +970,9 @@ XenEventChannel 
> > > *xen_device_bind_event_channel(XenDevice *xendev,
> > >  channel->handler = handler;
> > >  channel->opaque = opaque;
> > >
> > > -qemu_set_fd_handler(xenevtchn_fd(channel->xeh), xen_device_event, 
> > > NULL,
> > > -channel);
> > > +channel->ctx = ctx;
> > > +aio_set_fd_handler(channel->ctx, xenevtchn_fd(channel->xeh), false,
> > > +   xen_device_event, NULL, NULL, channel);
> > 
> > I wonder if the `'is_external' parameter of aio_set_fd_handler shoud be
> > `true' here, instead. That flag seems to be used when making a snapshot
> > of a blockdev, for example.
> > 
> > That was introduced by:
> > dca21ef23ba48f6f1428c59f295a857e5dc203c8^..c07bc2c1658fffeee08eb46402b2f66d55b07586
> > 
> > What do you think?
> 
> Interesting. I admit I was merely transcribing what qemu_set_fd_handler() 
> passes without really looking into the values. Looking at the arguments that 
> virtio-blk passes to aio_set_event_notifier() though, and what 'is_external' 
> means, it would appear that setting it to true is probably the right thing to 
> do. Do you want me to send a v2 of the series or can you fix it up?

Hi,
Handlers are invoked by the aio_poll() event loop.  Some handlers are
considered "external" in the sense that they submit new I/O requests
from the guest or outside world.  Others are considered "internal" in
the sense that they are part of the block layer and not an entry point
into the block layer.

There are points where the block layer wants to run the event loop but
new requests must not be submitted.  In this case aio_disable_external()
will be called so that "external" handlers are not processed.

For example, see virtio's virtio_queue_aio_set_host_notifier_handler().
This is the virtqueue kick ioeventfd and it shouldn't be processed when
aio_disable_external() has been called.

Stefan


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 1/3] pvh: Add x86/HVM direct boot ABI header file

2018-12-11 Thread Stefan Hajnoczi

On Wed, Dec 05, 2018 at 10:37:24PM +, Liam Merwick wrote:
> From: Liam Merwick 
> 
> The x86/HVM direct boot ABI permits Qemu to be able to boot directly
> into the uncompressed Linux kernel binary without the need to run firmware.
> 
>   https://xenbits.xen.org/docs/unstable/misc/pvh.html
> 
> This commit adds the header file that defines the start_info struct
> that needs to be populated in order to use this ABI.
> 
> Signed-off-by: Maran Wilson 
> Signed-off-by: Liam Merwick 
> Reviewed-by: Konrad Rzeszutek Wilk 
> ---
>  include/hw/xen/start_info.h | 146 
> 
>  1 file changed, 146 insertions(+)
>  create mode 100644 include/hw/xen/start_info.h

Does it make sense to bring in Linux
include/xen/interface/hvm/start_info.h via QEMU's
include/standard-headers/?

QEMU has a script in scripts/update-linux-header.sh for syncing Linux
headers into include/standard-headers/.  This makes it easy to keep
Linux header files up-to-date.  We basically treat files in
include/standard-headers/ as auto-generated.

If you define start_info.h yourself without using
include/standard-headers/, then it won't be synced with Linux.

signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 2/3] pc: Read PVH entry point from ELF note in kernel binary

2018-12-11 Thread Stefan Hajnoczi

On Wed, Dec 05, 2018 at 10:37:25PM +, Liam Merwick wrote:
> From: Liam Merwick 
> 
> Add support to read the PVH Entry address from an ELF note in the
> uncompressed kernel binary (as defined by the x86/HVM direct boot ABI).
> This 32-bit entry point will be used by QEMU to load the kernel in the
> guest and jump into the kernel entry point.
> 
> For now, a call to this function is added in pc_memory_init() to read the
> address - a future patch will use the entry point.
> 
> Signed-off-by: Liam Merwick 
> ---
>  hw/i386/pc.c  | 272 
> +-
>  include/elf.h |  10 +++
>  2 files changed, 281 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index f095725dbab2..056aa46d99b9 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -109,6 +109,9 @@ static struct e820_entry *e820_table;
>  static unsigned e820_entries;
>  struct hpet_fw_config hpet_cfg = {.count = UINT8_MAX};
>  
> +/* Physical Address of PVH entry point read from kernel ELF NOTE */
> +static size_t pvh_start_addr;
> +
>  void gsi_handler(void *opaque, int n, int level)
>  {
>  GSIState *s = opaque;
> @@ -834,6 +837,267 @@ struct setup_data {
>  uint8_t data[0];
>  } __attribute__((packed));
>  
> +/*
> + * Search through the ELF Notes for an entry with the given
> + * ELF Note type
> + */
> +static void *get_elf_note_type(void *ehdr, void *phdr, bool elf_is64,
> +size_t elf_note_type)

Generic ELF code.  Can you put it in hw/core/loader.c?

> +{
> +void *nhdr = NULL;
> +size_t nhdr_size = elf_is64 ? sizeof(Elf64_Nhdr) : sizeof(Elf32_Nhdr);
> +size_t elf_note_entry_sz = 0;
> +size_t phdr_off;
> +size_t phdr_align;
> +size_t phdr_memsz;
> +size_t nhdr_namesz;
> +size_t nhdr_descsz;
> +size_t note_type;

The macro tricks used by hw/core/loader.c are nasty, but I think they
get the types right.  Here the Elf64 on 32-bit host case is definitely
broken due to using size_t.  Perhaps 64-on-32 isn't supported, but
getting the types right is worth discussing.

> +
> +phdr_off = elf_is64 ?
> +((Elf64_Phdr *)phdr)->p_offset : ((Elf32_Phdr *)phdr)->p_offset;
> +phdr_align = elf_is64 ?
> +((Elf64_Phdr *)phdr)->p_align : ((Elf32_Phdr *)phdr)->p_align;
> +phdr_memsz = elf_is64 ?
> +((Elf64_Phdr *)phdr)->p_memsz : ((Elf32_Phdr *)phdr)->p_memsz;
> +
> +nhdr = ehdr + phdr_off;

The ELF file is untrusted.  All inputs must be validated.  phdr_off
could be an bogus/malicious value.

> +note_type = elf_is64 ?
> +((Elf64_Nhdr *)nhdr)->n_type : ((Elf32_Nhdr *)nhdr)->n_type;
> +nhdr_namesz = elf_is64 ?
> +((Elf64_Nhdr *)nhdr)->n_namesz : ((Elf32_Nhdr *)nhdr)->n_namesz;
> +nhdr_descsz = elf_is64 ?
> +((Elf64_Nhdr *)nhdr)->n_descsz : ((Elf32_Nhdr *)nhdr)->n_descsz;
> +
> +while (note_type != elf_note_type) {
> +elf_note_entry_sz = nhdr_size +
> +QEMU_ALIGN_UP(nhdr_namesz, phdr_align) +
> +QEMU_ALIGN_UP(nhdr_descsz, phdr_align);
> +
> +/*
> + * Verify that we haven't exceeded the end of the ELF Note section.
> + * If we have, then there is no note of the given type present
> + * in the ELF Notes.
> + */
> +if (phdr_off + phdr_memsz < ((nhdr - ehdr) + elf_note_entry_sz)) {
> +error_report("Note type (0x%lx) not found in ELF Note section",
> +elf_note_type);
> +return NULL;
> +}
> +
> +/* skip to the next ELF Note entry */
> +nhdr += elf_note_entry_sz;
> +note_type = elf_is64 ?
> +((Elf64_Nhdr *)nhdr)->n_type : ((Elf32_Nhdr *)nhdr)->n_type;
> +nhdr_namesz = elf_is64 ?
> +((Elf64_Nhdr *)nhdr)->n_namesz : ((Elf32_Nhdr *)nhdr)->n_namesz;
> +nhdr_descsz = elf_is64 ?
> +((Elf64_Nhdr *)nhdr)->n_descsz : ((Elf32_Nhdr *)nhdr)->n_descsz;
> +}
> +
> +return nhdr;
> +}
> +
> +/*
> + * The entry point into the kernel for PVH boot is different from
> + * the native entry point.  The PVH entry is defined by the x86/HVM
> + * direct boot ABI and is available in an ELFNOTE in the kernel binary.
> + * This function reads the ELF headers of the binary specified on the
> + * command line by -kernel (path contained in 'filename') and discovers
> + * the PVH entry address from the appropriate ELF Note.
> + *
> + * The address of the PVH entry point is saved to the 'pvh_start_addr'
> + * global variable. The ELF class of the binary is returned via 'elfclass'
> + * (although the entry point is 32-bit, the kernel binary can be either
> + * 32-bit or 64-bit).
> + */
> +static bool read_pvh_start_addr_elf_note(const char *filename,
> +unsigned char *elfclass)
> +{

Can this be integrated into ELF loading?  For example, could the elf
loader take a function pointer to perform additional logic (e.g.
extracting the PVH entry point)?  That avoids reparsing the input file.

> +void *

Re: [Xen-devel] [PATCH 2/2] avoid TABs in files that only contain a few

2018-12-14 Thread Stefan Hajnoczi

 |  2 +-
>  tests/tcg/cris/check_openpf1.c |  2 +-
>  tests/tcg/cris/check_settls1.c         |  2 +-
>  tests/tcg/i386/hello-i386.c| 14 ++--
>  tests/tcg/mips/hello-mips.c| 10 +--
>  tests/tcg/multiarch/sha1.c | 12 +--
>  tests/vhost-user-test.c|  4 +-
>  ui/keymaps.h   |  4 +-
>  ui/qemu-pixman.c   |  2 +-
>  ui/vnc-enc-zywrle-template.c   |  4 +-
>  ui/vnc.c   |  4 +-
>  util/bitops.c  |  4 +-
>  util/osdep.c   |  4 +-
>  util/qemu-sockets.c|  4 +-
>  94 files changed, 388 insertions(+), 388 deletions(-)

Block parts:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Qemu-block] [PATCH] qemu: include generated files with <> and not ""

2018-03-20 Thread Stefan Hajnoczi

On Tue, Mar 20, 2018 at 03:54:36AM +0200, Michael S. Tsirkin wrote:
> QEMU coding style at the moment asks for all non-system
> include files to be used with #include "foo.h".
> However this rule actually does not make sense and
> creates issues for when the included file is generated.
> 
> In C, include "file" means look in current directory,
> then on include search path. Current directory here
> means the source file directory.
> By comparison include  means look on include search path.
> 
> As generated files are not in the search directory (unless the build
> directory happens to match the source directory), it does not make sense
> to include them with "" - doing so is merely more work for preprocessor
> and a source or errors if a stale file happens to exist in the source
> directory.
> 
> This changes include directives for all generated files, across the
> tree. The idea is to avoid sending a huge amount of email.  But when
> merging, the changes will be split with one commit per file, e.g. for
> ease of bisect in case of build failures, and to ease merging.
> 
> Note that should some generated files be missed by this tree-wide
> refactoring, it isn't a big deal - this merely maintains the status quo,
> and this can be addressed by a separate patch on top.

Stale header files are a pain.  I often do make distclean before
checking out a radically different QEMU version to avoid the problem.

This patch trades off the stale header file issue for a new approach to
using <> vs "", which will be hard to use consistently in the future
since it is unconventional.

Is the build time improvement worth it (please post numbers)?

Stefan


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] compiler: add a sizeof_field() macro

2018-06-14 Thread Stefan Hajnoczi

Determining the size of a field is useful when you don't have a struct
variable handy.  Open-coding this is ugly.

This patch adds the sizeof_field() macro, which is similar to
typeof_field().  Existing instances are updated to use the macro.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/xen/io/ring.h  |  2 +-
 include/qemu/compiler.h   |  2 ++
 accel/tcg/translate-all.c |  2 +-
 hw/display/xenfb.c|  4 ++--
 hw/net/rocker/rocker_of_dpa.c |  2 +-
 hw/net/virtio-net.c   |  2 +-
 target/i386/kvm.c |  2 +-
 target/ppc/arch_dump.c| 10 +-
 target/s390x/arch_dump.c  | 20 ++--
 9 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/include/hw/xen/io/ring.h b/include/hw/xen/io/ring.h
index abbca47687..ffa3ebadc8 100644
--- a/include/hw/xen/io/ring.h
+++ b/include/hw/xen/io/ring.h
@@ -65,7 +65,7 @@ typedef unsigned int RING_IDX;
  */
 #define __CONST_RING_SIZE(_s, _sz) \
 (__RD32(((_sz) - offsetof(struct _s##_sring, ring)) / \
-   sizeof(((struct _s##_sring *)0)->ring[0])))
+sizeof_field(struct _s##_sring, ring[0])))
 /*
  * The same for passing in an actual pointer instead of a name tag.
  */
diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 9f762695d1..5843812710 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -64,6 +64,8 @@
 (type *) ((char *) __mptr - offsetof(type, member));})
 #endif
 
+#define sizeof_field(type, field) sizeof(((type *)0)->field)
+
 /* Convert from a base type to a parent type, with compile time checking.  */
 #ifdef __GNUC__
 #define DO_UPCAST(type, field, dev) ( __extension__ ( { \
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index d48b56ca38..767066ecd6 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -132,7 +132,7 @@ typedef struct PageDesc {
 
 /* Make sure all possible CPU event bits fit in tb->trace_vcpu_dstate */
 QEMU_BUILD_BUG_ON(CPU_TRACE_DSTATE_MAX_EVENTS >
-  sizeof(((TranslationBlock *)0)->trace_vcpu_dstate)
+  sizeof_field(TranslationBlock, trace_vcpu_dstate)
   * BITS_PER_BYTE);
 
 /*
diff --git a/hw/display/xenfb.c b/hw/display/xenfb.c
index f5afcc0358..911291c5c3 100644
--- a/hw/display/xenfb.c
+++ b/hw/display/xenfb.c
@@ -525,8 +525,8 @@ static int xenfb_configure_fb(struct XenFB *xenfb, size_t 
fb_len_lim,
   int width, int height, int depth,
   size_t fb_len, int offset, int row_stride)
 {
-size_t mfn_sz = sizeof(*((struct xenfb_page *)0)->pd);
-size_t pd_len = sizeof(((struct xenfb_page *)0)->pd) / mfn_sz;
+size_t mfn_sz = sizeof_field(struct xenfb_page, pd[0]);
+size_t pd_len = sizeof_field(struct xenfb_page, pd) / mfn_sz;
 size_t fb_pages = pd_len * XC_PAGE_SIZE / mfn_sz;
 size_t fb_len_max = fb_pages * XC_PAGE_SIZE;
 int max_width, max_height;
diff --git a/hw/net/rocker/rocker_of_dpa.c b/hw/net/rocker/rocker_of_dpa.c
index 60046720a5..8e347d1ee4 100644
--- a/hw/net/rocker/rocker_of_dpa.c
+++ b/hw/net/rocker/rocker_of_dpa.c
@@ -104,7 +104,7 @@ typedef struct of_dpa_flow_key {
 
 /* Width of key which includes field 'f' in u64s, rounded up */
 #define FLOW_KEY_WIDTH(f) \
-DIV_ROUND_UP(offsetof(OfDpaFlowKey, f) + sizeof(((OfDpaFlowKey *)0)->f), \
+DIV_ROUND_UP(offsetof(OfDpaFlowKey, f) + sizeof_field(OfDpaFlowKey, f), \
 sizeof(uint64_t))
 
 typedef struct of_dpa_flow_action {
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 90502fca7c..f154756e85 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -46,7 +46,7 @@
  * 'container'.
  */
 #define endof(container, field) \
-(offsetof(container, field) + sizeof(((container *)0)->field))
+(offsetof(container, field) + sizeof_field(container, field))
 
 typedef struct VirtIOFeature {
 uint64_t flags;
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 445e0e0b11..ad0e904b2c 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -1526,7 +1526,7 @@ static int kvm_put_fpu(X86CPU *cpu)
 #define XSAVE_PKRU672
 
 #define XSAVE_BYTE_OFFSET(word_offset) \
-((word_offset) * sizeof(((struct kvm_xsave *)0)->region[0]))
+((word_offset) * sizeof_field(struct kvm_xsave, region[0]))
 
 #define ASSERT_OFFSET(word_offset, field) \
 QEMU_BUILD_BUG_ON(XSAVE_BYTE_OFFSET(word_offset) != \
diff --git a/target/ppc/arch_dump.c b/target/ppc/arch_dump.c
index 351a65b22f..cc1460e4e3 100644
--- a/target/ppc/arch_dump.c
+++ b/target/ppc/arch_dump.c
@@ -210,11 +210,11 @@ static const struct NoteFuncDescStruct {
 int contents_size;
 void (*note_contents_func)(NoteFuncArg *arg, PowerPCCPU *cpu);
 } note_func[] = {
-{sizeof(((Note *)0)->contents.prstatus),  ppc_write_elf_prstatus},
-{sizeof(((Note *)0)->contents.fpregset),  ppc_write_elf_fpregset},
-{sizeof(((Note *)0)

Re: [Xen-devel] [Qemu-devel] [PATCH] compiler: add a sizeof_field() macro

2018-06-15 Thread Stefan Hajnoczi

On Thu, Jun 14, 2018 at 9:33 PM, Philippe Mathieu-Daudé  wrote:
> On 06/14/2018 04:17 PM, John Snow wrote:
>> On 06/14/2018 12:44 PM, Stefan Hajnoczi wrote:
>>> Determining the size of a field is useful when you don't have a struct
>>> variable handy.  Open-coding this is ugly.
>>>
>>> This patch adds the sizeof_field() macro, which is similar to
>>> typeof_field().  Existing instances are updated to use the macro.
>>>
>>> Signed-off-by: Stefan Hajnoczi 
>>
>> How'd you find all the existing instances?
>
> This works:
>
> $ git grep -E 'sizeof.*)0)->'

Yes, I used a similar grep command-line.

I also checked for "sizeof.*)NULL" but nothing uses that syntax.

Stefan

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Qemu-devel] [PATCH] compiler: add a sizeof_field() macro

2018-06-18 Thread Stefan Hajnoczi

On Thu, Jun 14, 2018 at 05:44:31PM +0100, Stefan Hajnoczi wrote:
> Determining the size of a field is useful when you don't have a struct
> variable handy.  Open-coding this is ugly.
> 
> This patch adds the sizeof_field() macro, which is similar to
> typeof_field().  Existing instances are updated to use the macro.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  include/hw/xen/io/ring.h  |  2 +-
>  include/qemu/compiler.h   |  2 ++
>  accel/tcg/translate-all.c |  2 +-
>  hw/display/xenfb.c|  4 ++--
>  hw/net/rocker/rocker_of_dpa.c |  2 +-
>  hw/net/virtio-net.c   |  2 +-
>  target/i386/kvm.c |  2 +-
>  target/ppc/arch_dump.c| 10 +-
>  target/s390x/arch_dump.c  | 20 ++--
>  9 files changed, 24 insertions(+), 22 deletions(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [PATCH] qemu/atomic.h: prefix qemu_ to solve collisions

2020-09-22 Thread Stefan Hajnoczi

On Mon, Sep 21, 2020 at 01:56:08PM -0700, no-re...@patchew.org wrote:
> ERROR: Macros with multiple statements should be enclosed in a do - while loop
> #2968: FILE: include/qemu/atomic.h:152:
> +#define qemu_atomic_rcu_read__nocheck(ptr, valptr)  \
>  __atomic_load(ptr, valptr, __ATOMIC_RELAXED);   \
>  smp_read_barrier_depends();
> 
> ERROR: space required before that '*' (ctx:VxB)
> #3123: FILE: include/qemu/atomic.h:347:
> +#define qemu_atomic_read__nocheck(p) (*(__typeof__(*(p)) volatile*) (p))
>   ^
> 
> ERROR: Use of volatile is usually wrong, please add a comment
> #3123: FILE: include/qemu/atomic.h:347:
> +#define qemu_atomic_read__nocheck(p) (*(__typeof__(*(p)) volatile*) (p))
> 
> ERROR: space required before that '*' (ctx:VxB)
> #3125: FILE: include/qemu/atomic.h:349:
> +((*(__typeof__(*(p)) volatile*) (p)) = (i))
>   ^
> 
> ERROR: Use of volatile is usually wrong, please add a comment
> #3125: FILE: include/qemu/atomic.h:349:
> +((*(__typeof__(*(p)) volatile*) (p)) = (i))
> 
> ERROR: space required after that ',' (ctx:VxV)
> #3130: FILE: include/qemu/atomic.h:352:
> +#define qemu_atomic_set(ptr, i) qemu_atomic_set__nocheck(ptr,i)
>  ^
> 
> ERROR: memory barrier without comment
> #3205: FILE: include/qemu/atomic.h:410:
> +#define qemu_atomic_xchg(ptr, i) (smp_mb(), __sync_lock_test_and_set(ptr, i))
> 
> WARNING: Block comments use a leading /* on a separate line
> #3280: FILE: include/qemu/atomic.h:462:
> +/* qemu_atomic_mb_read/set semantics map Java volatile variables. They are
> 
> WARNING: Block comments use a leading /* on a separate line
> #6394: FILE: util/bitmap.c:214:
> +/* If we avoided the full barrier in qemu_atomic_or(), issue a
> 
> WARNING: Block comments use a leading /* on a separate line
> #7430: FILE: util/rcu.c:85:
> +/* Instead of using qemu_atomic_mb_set for index->waiting, and
> 
> WARNING: Block comments use a leading /* on a separate line
> #7456: FILE: util/rcu.c:154:
> +/* In either case, the qemu_atomic_mb_set below blocks stores that 
> free
> 
> total: 7 errors, 4 warnings, 6507 lines checked

These are pre-existing coding style issues. This is a big patch that
tries to make as few actual changes as possible so I would rather not
try to fix them.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] qemu/atomic.h: prefix qemu_ to solve collisions

2020-09-22 Thread Stefan Hajnoczi

On Mon, Sep 21, 2020 at 04:29:10PM -0500, Eric Blake wrote:
> On 9/21/20 11:23 AM, Stefan Hajnoczi wrote:

Thanks for the review! Your feedback prompted me to do this more
systematically. I fixed the command-lines and published a diff of just
the manual changes I made on top of the mechanical changes (see v2).

> > clang's C11 atomic_fetch_*() functions only take a C11 atomic type
> > pointer argument. QEMU uses direct types (int, etc) and this causes a
> > compiler error when a QEMU code calls these functions in a source file
> > that also included  via a system header file:
> > 
> >$ CC=clang CXX=clang++ ./configure ... && make
> >../util/async.c:79:17: error: address argument to atomic operation must 
> > be a pointer to _Atomic type ('unsigned int *' invalid)
> > 
> > Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
> > used by . Prefix QEMU's APIs with qemu_ so that atomic.h
> > and  can co-exist.
> > 
> > This patch was generated using:
> > 
> >$ git diff | grep -o '\ > >/tmp/changed_identifiers
> 
> Missing a step in the recipe: namely, you probably modified
> include/qemu/atomic*.h prior to running 'git diff' (so that you actually had
> input to feed to grep -o).  But spelling it 'git diff HEAD^
> include/qemu/atomic*.h | ...' does indeed give me a sane list of identifiers
> that looks like what you touched in the rest of the patch.

Yes, I edited the file first and then used this command-line. The one
you posted it better :).

> 
> >$ for identifier in $( 
> Also not quite the right recipe, based on the file name used in the line
> above.

Yes, "64" is when I realized the original grep expression hadn't matched
the atomic64 APIs.

These commands only show the gist of it. It involved a few manual steps.

> 
> > sed -i "s%\<$identifier\>%qemu_$identifier%" $(git grep -l 
> > "\<$identifier\>") \
> >  done
> > 
> 
> Fortunately, running "git grep -c '\ state of the tree gives me a list that is somewhat close to yours, where the
> obvious difference in line counts is explained by:
> 
> > I manually fixed line-wrap issues and misaligned rST tables.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> 
> First, focusing on the change summary:
> 
> >   docs/devel/lockcnt.txt|  14 +-
> >   docs/devel/rcu.txt|  40 +--
> >   accel/tcg/atomic_template.h   |  20 +-
> >   include/block/aio-wait.h  |   4 +-
> >   include/block/aio.h   |   8 +-
> >   include/exec/cpu_ldst.h   |   2 +-
> >   include/exec/exec-all.h   |   6 +-
> >   include/exec/log.h|   6 +-
> >   include/exec/memory.h |   2 +-
> >   include/exec/ram_addr.h   |  27 +-
> >   include/exec/ramlist.h|   2 +-
> >   include/exec/tb-lookup.h  |   4 +-
> >   include/hw/core/cpu.h |   2 +-
> >   include/qemu/atomic.h | 258 +++---
> >   include/qemu/atomic128.h  |   6 +-
> 
> These two are the most important for the sake of this patch; perhaps it's
> worth a temporary override of your git orderfile if you have to respin, to
> list them first?

Will do in v2.

> 
> >   include/qemu/bitops.h |   2 +-
> >   include/qemu/coroutine.h  |   2 +-
> >   include/qemu/log.h|   6 +-
> >   include/qemu/queue.h  |   8 +-
> >   include/qemu/rcu.h|  10 +-
> >   include/qemu/rcu_queue.h  | 103 +++---
> 
> Presumably, this and any other file with an odd number of changes was due to
> a difference in lines after reformatting long lines.

Yes, line-wrapping required many changes in this file.

> 
> >   include/qemu/seqlock.h|   8 +-
> ...
> 
> >   util/stats64.c|  34 +-
> >   docs/devel/atomics.rst| 326 +-
> >   .../opensbi-riscv32-generic-fw_dynamic.elf| Bin 558668 -> 558698 bytes
> >   .../opensbi-riscv64-generic-fw_dynamic.elf| Bin 620424 -> 620454 bytes
> 
> Why are we regenerating .elf files in this patch?  Is your change even
> correct for those two files?

Thanks for noticing this! The git-grep(1) man page docu

Re: [PATCH] qemu/atomic.h: prefix qemu_ to solve collisions

2020-09-23 Thread Stefan Hajnoczi

On Tue, Sep 22, 2020 at 09:18:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Sep 22, 2020 at 08:56:06AM +0200, Paolo Bonzini wrote:
> > On 22/09/20 08:45, David Hildenbrand wrote:
> > >> It's certainly a good idea but it's quite verbose.
> > >>
> > >> What about using atomic__* as the prefix?  It is not very common in QEMU
> > >> but there are some cases (and I cannot think of anything better).
> > >
> > > aqomic_*, lol :)
> > 
> > Actually qatomic_ would be a good one, wouldn't it?
> 
> Yes, I think just adding a 'q' on the front of methods is more than
> sufficient (see also all the qcrypto_*, qio_* APIs I wrote). The
> only think a plain 'q' prefix is likely to clash with is the Qt
> library and that isn't something we're likely to link with (famous
> last words...).

This is why I didn't use "qatomic". "atomic" feels too common to prefix
with just a single letter.

But I grepped /usr/include and code searched GitHub. I can't find any
uses of "qatomic_" so it looks safe. FWIW Qt does have qatomic.h but
doesn't use the name for identifiers in the code.

Let's do it!

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2] qemu/atomic.h: prefix qemu_ to solve collisions

2020-09-23 Thread Stefan Hajnoczi

On Tue, Sep 22, 2020 at 01:35:37PM +0200, Paolo Bonzini wrote:
> On 22/09/20 10:58, Stefan Hajnoczi wrote:
> I think the reviews crossed, are you going to respin using a qatomic_
> prefix?

Yes, let's do qatomic_. I'll send a v3.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v3] qemu/atomic.h: rename atomic_ to qatomic_

2020-09-23 Thread Stefan Hajnoczi

On Wed, Sep 23, 2020 at 11:56:46AM +0100, Stefan Hajnoczi wrote:
> clang's C11 atomic_fetch_*() functions only take a C11 atomic type
> pointer argument. QEMU uses direct types (int, etc) and this causes a
> compiler error when a QEMU code calls these functions in a source file
> that also included  via a system header file:
> 
>   $ CC=clang CXX=clang++ ./configure ... && make
>   ../util/async.c:79:17: error: address argument to atomic operation must be 
> a pointer to _Atomic type ('unsigned int *' invalid)
> 
> Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
> used by . Prefix QEMU's APIs with 'q' so that atomic.h
> and  can co-exist. I checked /usr/include on my machine and
> searched GitHub for existing "qatomic_" users but there seem to be none.
> 
> This patch was generated using:
> 
>   $ git grep -h -o '\ sort -u >/tmp/changed_identifiers
>   $ for identifier in $( sed -i "s%\<$identifier\>%q$identifier%g" \
> $(git grep -I -l "\<$identifier\>")
> done
> 
> I manually fixed line-wrap issues and misaligned rST tables.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
> v3:
>  * Use qatomic_ instead of atomic_ [Paolo]
>  * The diff of my manual fixups is available here:
>https://vmsplice.net/~stefan/atomic-namespace-pre-fixups-v3.diff
>- Dropping #ifndef qatomic_fetch_add in atomic.h
>- atomic_##X(haddr, val) glue macros not caught by grep
>- Keep atomic_add-bench name
>- C preprocessor backslash-newline ('\') column alignment
>- Line wrapping

Thanks, applied quickly due to high risk of conflicts:
https://github.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 23/24] virtio-blk: remove a spurious call to revalidate_disk_size

2020-11-09 Thread Stefan Hajnoczi

On Fri, Nov 06, 2020 at 08:03:35PM +0100, Christoph Hellwig wrote:
> revalidate_disk_size just updates the block device size from the disk
> size.  Thus calling it from revalidate_disk_size doesn't actually do
> anything.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/virtio_blk.c | 1 -
>  1 file changed, 1 deletion(-)

Modulo Paolo's comment:

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v4 3/3] hw: replace most qemu_bh_new calls with qemu_bh_new_guarded

2023-01-25 Thread Stefan Hajnoczi

On Thu, Jan 19, 2023 at 02:03:08AM -0500, Alexander Bulekov wrote:
> This protects devices from bh->mmio reentrancy issues.
> 
> Signed-off-by: Alexander Bulekov 
> ---
>  hw/9pfs/xen-9p-backend.c| 4 +++-
>  hw/block/dataplane/virtio-blk.c | 3 ++-
>  hw/block/dataplane/xen-block.c  | 5 +++--
>  hw/block/virtio-blk.c   | 5 +++--
>  hw/char/virtio-serial-bus.c | 3 ++-
>  hw/display/qxl.c| 9 ++---
>  hw/display/virtio-gpu.c | 6 --
>  hw/ide/ahci.c   | 3 ++-
>  hw/ide/core.c   | 3 ++-
>  hw/misc/imx_rngc.c  | 6 --
>  hw/misc/macio/mac_dbdma.c   | 2 +-
>  hw/net/virtio-net.c | 3 ++-
>  hw/nvme/ctrl.c  | 6 --
>  hw/scsi/mptsas.c| 3 ++-
>  hw/scsi/scsi-bus.c  | 3 ++-
>  hw/scsi/vmw_pvscsi.c| 3 ++-
>  hw/usb/dev-uas.c| 3 ++-
>  hw/usb/hcd-dwc2.c   | 3 ++-
>  hw/usb/hcd-ehci.c   | 3 ++-
>  hw/usb/hcd-uhci.c   | 2 +-
>  hw/usb/host-libusb.c| 6 --
>  hw/usb/redirect.c   | 6 --
>  hw/usb/xen-usb.c| 3 ++-
>  hw/virtio/virtio-balloon.c  | 5 +++--
>  hw/virtio/virtio-crypto.c   | 3 ++-
>  25 files changed, 66 insertions(+), 35 deletions(-)

Should scripts/checkpatch.pl complain when qemu_bh_new() or aio_bh_new()
are called from hw/? Adding a check is important so new instances cannot
be added accidentally in the future.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v4 3/3] hw: replace most qemu_bh_new calls with qemu_bh_new_guarded

2023-01-25 Thread Stefan Hajnoczi

On Thu, Jan 19, 2023 at 02:03:08AM -0500, Alexander Bulekov wrote:
> This protects devices from bh->mmio reentrancy issues.
> 
> Signed-off-by: Alexander Bulekov 
> ---
>  hw/9pfs/xen-9p-backend.c| 4 +++-
>  hw/block/dataplane/virtio-blk.c | 3 ++-
>  hw/block/dataplane/xen-block.c  | 5 +++--
>  hw/block/virtio-blk.c   | 5 +++--
>  hw/char/virtio-serial-bus.c | 3 ++-
>  hw/display/qxl.c| 9 ++---
>  hw/display/virtio-gpu.c | 6 --
>  hw/ide/ahci.c   | 3 ++-
>  hw/ide/core.c   | 3 ++-
>  hw/misc/imx_rngc.c  | 6 --
>  hw/misc/macio/mac_dbdma.c   | 2 +-
>  hw/net/virtio-net.c | 3 ++-
>  hw/nvme/ctrl.c  | 6 --
>  hw/scsi/mptsas.c| 3 ++-
>  hw/scsi/scsi-bus.c  | 3 ++-
>  hw/scsi/vmw_pvscsi.c| 3 ++-
>  hw/usb/dev-uas.c| 3 ++-
>  hw/usb/hcd-dwc2.c   | 3 ++-
>  hw/usb/hcd-ehci.c   | 3 ++-
>  hw/usb/hcd-uhci.c   | 2 +-
>  hw/usb/host-libusb.c| 6 --
>  hw/usb/redirect.c   | 6 --
>  hw/usb/xen-usb.c| 3 ++-
>  hw/virtio/virtio-balloon.c  | 5 +++--
>  hw/virtio/virtio-crypto.c   | 3 ++-
>  25 files changed, 66 insertions(+), 35 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory mapping support

2023-03-01 Thread Stefan Hajnoczi

Resend - for some reason my email didn't make it out.

- Forwarded message from Stefan Hajnoczi  -

Date: Tue, 21 Feb 2023 10:17:01 -0500
From: Stefan Hajnoczi 
To: Viresh Kumar 
Cc: qemu-de...@nongnu.org, virtio-...@lists.oasis-open.org, "Michael S. 
Tsirkin" , Vincent Guittot , Alex 
Bennée ,
stratos-...@op-lists.linaro.org, Oleksandr Tyshchenko 
, xen-de...@lists.xen.org, Andrew Cooper 
, Juergen Gross , Sebastien Boeuf
, Liu Jiang , 
Mathieu Poirier 
Subject: Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory 
mapping support
Message-ID: 

On Tue, Feb 21, 2023 at 03:20:41PM +0530, Viresh Kumar wrote:
> The current model of memory mapping at the back-end works fine with
> Qemu, where a standard call to mmap() for the respective file
> descriptor, passed from front-end, is generally all we need to do before
> the front-end can start accessing the guest memory.
> 
> There are other complex cases though, where we need more information at
> the backend and need to do more than just an mmap() call. For example,
> Xen, a type-1 hypervisor, currently supports memory mapping via two
> different methods, foreign-mapping (via /dev/privcmd) and grant-dev (via
> /dev/gntdev). In both these cases, the back-end needs to call mmap()
> followed by an ioctl() (or vice-versa), and need to pass extra
> information via the ioctl(), like the Xen domain-id of the guest whose
> memory we are trying to map.
> 
> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_CUSTOM_MMAP', which
> lets the back-end know about the additional memory mapping requirements.
> When this feature is negotiated, the front-end can send the
> 'VHOST_USER_CUSTOM_MMAP' message type to provide the additional
> information to the back-end.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  docs/interop/vhost-user.rst | 32 
>  1 file changed, 32 insertions(+)

The alternative to an in-band approach is to configure these details
out-of-band. For example, via command-line options to the vhost-user
back-end:

  $ my-xen-device --mapping-type=foreign-mapping --domain-id=123

I was thinking about both approaches and don't see an obvious reason to
choose one or the other. What do you think?

> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 3f18ab424eb0..f2b1d705593a 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -258,6 +258,23 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Custom mmap description
> +^^^
> +
> ++---+---+
> +| flags | value |
> ++---+---+
> +
> +:flags: 64-bit bit field
> +
> +- Bit 0 is Xen foreign memory access flag - needs Xen foreign memory mapping.
> +- Bit 1 is Xen grant memory access flag - needs Xen grant memory mapping.
> +
> +:value: a 64-bit hypervisor specific value.
> +
> +- For Xen foreign or grant memory access, this is set with guest's xen domain
> +  id.

This is highly Xen-specific. How about naming the feature XEN_MMAP
instead of CUSTOM_MMAP? If someone needs to add other mmap data later,
they should define their own struct instead of trying to squeeze into
the same fields as Xen.

There is an assumption in this design that a single
VHOST_USER_CUSTOM_MMAP message provides the information necessary for
all mmaps. Are you sure the limitation that every mmap belongs to the
same domain will be workable in the future?

> +
>  C structure
>  ---
>  
> @@ -867,6 +884,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
>#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>#define VHOST_USER_PROTOCOL_F_STATUS   16
> +  #define VHOST_USER_PROTOCOL_F_CUSTOM_MMAP  17
>  
>  Front-end message types
>  ---
> @@ -1422,6 +1440,20 @@ Front-end message types
>query the back-end for its device status as defined in the Virtio
>specification.
>  
> +``VHOST_USER_CUSTOM_MMAP``

Most vhost-user protocol messages have a verb like
get/set/close/add/listen/etc. I suggest renaming this to
VHOST_USER_SET_XEN_MMAP_INFO.

> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :request payload: Custom mmap description
> +  :reply payload: N/A
> +
> +  When the ``VHOST_USER_PROTOCOL_F_CUSTOM_MMAP`` protocol feature has been
> +  successfully negotiated, this message is submitted by the front-end to
> +  notify the back-end of the special memory mapping requirements, that the
> +  back-end needs to take care of, while mapping any memory regions sent
> +  over by the front-end. The front-end must send this message before
> +  any memory-regions are sent to the back-end via 
> ``VHOST_USER_SET_MEM_TABLE``
>

Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory mapping support

2023-03-01 Thread Stefan Hajnoczi

On Wed, Mar 01, 2023 at 04:31:41PM +, Alex Bennée wrote:
> 
> Stefan Hajnoczi  writes:
> 
> > [[PGP Signed Part:Undecided]]
> > Resend - for some reason my email didn't make it out.
> >
> > From: Stefan Hajnoczi 
> > Subject: Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory
> >  mapping support
> > To: Viresh Kumar 
> > Cc: qemu-de...@nongnu.org, virtio-...@lists.oasis-open.org, "Michael S.
> >  Tsirkin" , Vincent Guittot ,
> >  Alex Bennée ,
> > stratos-...@op-lists.linaro.org, Oleksandr Tyshchenko
> >  , xen-de...@lists.xen.org, Andrew Cooper
> >  , Juergen Gross , Sebastien
> >  Boeuf
> > , Liu Jiang , 
> > Mathieu
> >  Poirier 
> > Date: Tue, 21 Feb 2023 10:17:01 -0500 (1 week, 1 day, 1 hour ago)
> > Flags: seen, signed, personal
> >
> > On Tue, Feb 21, 2023 at 03:20:41PM +0530, Viresh Kumar wrote:
> >> The current model of memory mapping at the back-end works fine with
> >> Qemu, where a standard call to mmap() for the respective file
> >> descriptor, passed from front-end, is generally all we need to do before
> >> the front-end can start accessing the guest memory.
> >> 
> >> There are other complex cases though, where we need more information at
> >> the backend and need to do more than just an mmap() call. For example,
> >> Xen, a type-1 hypervisor, currently supports memory mapping via two
> >> different methods, foreign-mapping (via /dev/privcmd) and grant-dev (via
> >> /dev/gntdev). In both these cases, the back-end needs to call mmap()
> >> followed by an ioctl() (or vice-versa), and need to pass extra
> >> information via the ioctl(), like the Xen domain-id of the guest whose
> >> memory we are trying to map.
> >> 
> >> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_CUSTOM_MMAP', which
> >> lets the back-end know about the additional memory mapping requirements.
> >> When this feature is negotiated, the front-end can send the
> >> 'VHOST_USER_CUSTOM_MMAP' message type to provide the additional
> >> information to the back-end.
> >> 
> >> Signed-off-by: Viresh Kumar 
> >> ---
> >>  docs/interop/vhost-user.rst | 32 
> >>  1 file changed, 32 insertions(+)
> >
> > The alternative to an in-band approach is to configure these details
> > out-of-band. For example, via command-line options to the vhost-user
> > back-end:
> >
> >   $ my-xen-device --mapping-type=foreign-mapping --domain-id=123
> >
> > I was thinking about both approaches and don't see an obvious reason to
> > choose one or the other. What do you think?
> 
> In-band has the nice property of being dynamic and not having to have
> some other thing construct command lines. We are also trying to keep the
> daemons from being Xen specific and keep the type of mmap as an
> implementation detail that is mostly elided by the rust-vmm memory
> traits.

Okay.

> >
> >> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> >> index 3f18ab424eb0..f2b1d705593a 100644
> >> --- a/docs/interop/vhost-user.rst
> >> +++ b/docs/interop/vhost-user.rst
> >> @@ -258,6 +258,23 @@ Inflight description
> >>  
> >>  :queue size: a 16-bit size of virtqueues
> >>  
> >> +Custom mmap description
> >> +^^^
> >> +
> >> ++---+---+
> >> +| flags | value |
> >> ++---+---+
> >> +
> >> +:flags: 64-bit bit field
> >> +
> >> +- Bit 0 is Xen foreign memory access flag - needs Xen foreign memory 
> >> mapping.
> >> +- Bit 1 is Xen grant memory access flag - needs Xen grant memory mapping.
> >> +
> >> +:value: a 64-bit hypervisor specific value.
> >> +
> >> +- For Xen foreign or grant memory access, this is set with guest's xen 
> >> domain
> >> +  id.
> >
> > This is highly Xen-specific. How about naming the feature XEN_MMAP
> > instead of CUSTOM_MMAP? If someone needs to add other mmap data later,
> > they should define their own struct instead of trying to squeeze into
> > the same fields as Xen.
> 
> We hope to support additional mmap mechanisms in the future - one
> proposal is to use the hypervisor specific interface to support an
> ioctl() that creates a domain specific device which can then be treated
> more generically.
> 
> Could we not declare the message as flag + n bytes of domain specific
> messag

Re: [virtio-dev] [RFC QEMU] docs: vhost-user: Add custom memory mapping support

2023-03-02 Thread Stefan Hajnoczi

On Thu, Mar 02, 2023 at 01:49:07PM +0530, Viresh Kumar wrote:
> On 01-03-23, 12:29, Stefan Hajnoczi wrote:
> > What is the advantage over defining separate messages? Separate messages
> > are cleaner and more typesafe.
> 
> I thought we wanted to keep single message for one kind of functionality, 
> which
> is mmap related quirks here. And so it would be better if we can reuse the 
> same
> for next hypervisor which may need this.
> 
> The value parameter is not fixed and is hypervisor specific, for Xen this is 
> the
> domain id, for others it may mean something else.

mmap-related quirks have no parameters or behavior in common so there's
no advantage in sharing a single vhost-user protocol message. Sharing
the same message just makes it awkward to build and parse the message.

> > I don't have a concrete example, but was thinking of a guest that shares
> > memory with other guests (like the experimental virtio-vhost-user
> > device). Maybe there would be a scenario where some memory belongs to
> > one domain and some belongs to another (but has been mapped into the
> > first domain), and the vhost-user back-end needs to access both.
> 
> These look tricky (and real) and I am not sure how we would want to handle
> these. Maybe wait until we have a real use-case ?

A way to deal with that is to include mmap information every time fds
are passed with a message instead of sending one global message at the
start of the vhost-user connection. This would allow each mmap to
associate extra information instead of forcing them all to use the same
information.

> > The other thing that comes to mind is that the spec must clearly state
> > which mmaps are affected by the Xen domain information. For example,
> > just mem table memory regions and not the
> > VHOST_USER_PROTOCOL_F_LOG_SHMFD feature?
> 
> Maybe we can mention that only the mmap's performed via /dev/xen/privcmd and
> /dev/xen/gntdev files are affected by this ?

No, this doesn't explain when mmap must be performed via
/dev/xen/privcmd and /dev/xen/gntdev. The spec should be explicit about
this instead of assuming that the device implementer already knows this.

Stefan

signature.asc
Description: PGP signature

Re: [PATCH V2] docs: vhost-user: Add Xen specific memory mapping support

2023-03-06 Thread Stefan Hajnoczi

On Mon, Mar 06, 2023 at 04:40:24PM +0530, Viresh Kumar wrote:
> The current model of memory mapping at the back-end works fine where a
> standard call to mmap() (for the respective file descriptor) is enough
> before the front-end can start accessing the guest memory.
> 
> There are other complex cases though where the back-end needs more
> information and simple mmap() isn't enough. For example Xen, a type-1
> hypervisor, currently supports memory mapping via two different methods,
> foreign-mapping (via /dev/privcmd) and grant-dev (via /dev/gntdev). In
> both these cases, the back-end needs to call mmap() and ioctl(), and
> need to pass extra information via the ioctl(), like the Xen domain-id
> of the guest whose memory we are trying to map.
> 
> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_XEN_MMAP', which lets
> the back-end know about the additional memory mapping requirements.
> When this feature is negotiated, the front-end can send the
> 'VHOST_USER_SET_XEN_MMAP' message type to provide the additional
> information to the back-end.
> 
> Signed-off-by: Viresh Kumar 
> ---
> V1->V2:
> - Make the custom mmap feature Xen specific, instead of being generic.
> - Clearly define which memory regions are impacted by this change.
> - Allow VHOST_USER_SET_XEN_MMAP to be called multiple times.
> - Additional Bit(2) property in flags.
> 
>  docs/interop/vhost-user.rst | 36 
>  1 file changed, 36 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 3f18ab424eb0..8be5f5eae941 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -258,6 +258,24 @@ Inflight description
>  
>  :queue size: a 16-bit size of virtqueues
>  
> +Xen mmap description
> +
> +
> ++---+---+
> +| flags | domid |
> ++---+---+
> +
> +:flags: 64-bit bit field
> +
> +- Bit 0 is set for Xen foreign memory memory mapping.
> +- Bit 1 is set for Xen grant memory memory mapping.
> +- Bit 2 is set if the back-end can directly map additional memory (like
> +  descriptor buffers or indirect descriptors, which aren't part of already
> +  shared memory regions) without the need of front-end sending an additional
> +  memory region first.

I don't understand what Bit 2 does. Can you rephrase this? It's unclear
to me how additional memory can be mapped without a memory region
(especially the fd) is sent?

> +
> +:domid: a 64-bit Xen hypervisor specific domain id.
> +
>  C structure
>  ---
>  
> @@ -867,6 +885,7 @@ Protocol features
>#define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
>#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
>#define VHOST_USER_PROTOCOL_F_STATUS   16
> +  #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
>  
>  Front-end message types
>  ---
> @@ -1422,6 +1441,23 @@ Front-end message types
>query the back-end for its device status as defined in the Virtio
>specification.
>  
> +``VHOST_USER_SET_XEN_MMAP``
> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :request payload: Xen mmap description
> +  :reply payload: N/A
> +
> +  When the ``VHOST_USER_PROTOCOL_F_XEN_MMAP`` protocol feature has been
> +  successfully negotiated, this message is submitted by the front-end to set 
> the
> +  Xen hypervisor specific memory mapping configurations at the back-end.  
> These
> +  configurations should be used to mmap memory regions, virtqueues, 
> descriptors
> +  and descriptor buffers. The front-end must send this message before any
> +  memory-regions are sent to the back-end via ``VHOST_USER_SET_MEM_TABLE`` or
> +  ``VHOST_USER_ADD_MEM_REG`` message types. The front-end can send this 
> message
> +  multiple times, if different mmap configurations are required for different
> +  memory regions, where the most recent ``VHOST_USER_SET_XEN_MMAP`` must be 
> used
> +  by the back-end to map any newly shared memory regions.

This message modifies the behavior of subsequent
VHOST_USER_SET_MEM_TABLE and VHOST_USER_ADD_MEM_REG messages. The memory
region structs can be extended and then VHOST_USER_SET_XEN_MMAP isn't
needed.

In other words:

  When VHOST_USER_PROTOCOL_F_XEN_MMAP is negotiated, each "Memory
  regions description" and "Single memory region description" has the
  following additional fields appended:

  ++---+
  | xen_mmap_flags | domid |
  ++---+

  :xen_mmap_flags: 64-bit bit field
  :domid: a 64-bit Xen hypervisor specific domain id.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH V2] docs: vhost-user: Add Xen specific memory mapping support

2023-03-07 Thread Stefan Hajnoczi

On Tue, Mar 07, 2023 at 11:13:36AM +0530, Viresh Kumar wrote:
> On 06-03-23, 10:34, Stefan Hajnoczi wrote:
> > On Mon, Mar 06, 2023 at 04:40:24PM +0530, Viresh Kumar wrote:
> > > +Xen mmap description
> > > +
> > > +
> > > ++---+---+
> > > +| flags | domid |
> > > ++---+---+
> > > +
> > > +:flags: 64-bit bit field
> > > +
> > > +- Bit 0 is set for Xen foreign memory memory mapping.
> > > +- Bit 1 is set for Xen grant memory memory mapping.
> > > +- Bit 2 is set if the back-end can directly map additional memory (like
> > > +  descriptor buffers or indirect descriptors, which aren't part of 
> > > already
> > > +  shared memory regions) without the need of front-end sending an 
> > > additional
> > > +  memory region first.
> > 
> > I don't understand what Bit 2 does. Can you rephrase this? It's unclear
> > to me how additional memory can be mapped without a memory region
> > (especially the fd) is sent?
> 
> I (somehow) assumed we will be able to use the same file descriptor
> that was shared for the virtqueues memory regions and yes I can see
> now why it wouldn't work or create problems.
> 
> And I need suggestion now on how to make this work.
> 
> With Xen grants, the front end receives grant address from the from
> guest kernel, they aren't physical addresses, kind of IOMMU stuff.
> 
> The back-end gets access for memory regions of the virtqueues alone
> initially.  When the back-end gets a request, it reads the descriptor
> and finds the buffer address, which isn't part of already shared
> regions. The same happens for descriptor addresses in case indirect
> descriptor feature is negotiated.
> 
> At this point I was thinking maybe the back-end can simply call the
> mmap/ioctl to map the memory, using the file descriptor used for the
> virtqueues.
> 
> How else can we make this work ? We also need to unmap/remove the
> memory region, as soon as the buffer is processed as the grant address
> won't be relevant for any subsequent request.
> 
> Should I use VHOST_USER_IOTLB_MSG for this ? I did look at it and I
> wasn't convinced if it was an exact fit. For example it says that a
> memory address reported with miss/access fail should be part of an
> already sent memory region, which isn't the case here.

VHOST_USER_IOTLB_MSG probably isn't necessary because address
translation is not required. It will also reduce performance by adding
extra communication.

Instead, you could change the 1 memory region : 1 mmap relationship that
existing non-Xen vhost-user back-end implementations have. In Xen
vhost-user back-ends, the memory region details (including the file
descriptor and Xen domain id) would be stashed away in back-end when the
front-end adds memory regions. No mmap would be performed upon
VHOST_USER_ADD_MEM_REG or VHOST_USER_SET_MEM_TABLE.

Whenever the back-end needs to do DMA, it looks up the memory region and
performs the mmap + Xen-specific calls:
- A long-lived mmap of the vring is set up when
  VHOST_USER_SET_VRING_ENABLE is received.
- Short-lived mmaps of the indirect descriptors and memory pointed to by
  the descriptors is set up by the virtqueue processing code.

Does this sound workable to you?

Stefan


signature.asc
Description: PGP signature

Re: [PATCH V3 1/2] docs: vhost-user: Define memory region separately

2023-03-15 Thread Stefan Hajnoczi

On Thu, Mar 09, 2023 at 02:21:00PM +0530, Viresh Kumar wrote:
> The same layout is defined twice, once in "single memory region
> description" and then in "memory regions description".
> 
> Separate out details of memory region from these two and reuse the same
> definition later on.
> 
> While at it, also rename "memory regions description" to "multiple
> memory regions description", to avoid potential confusion around similar
> names. And define single region before multiple ones.
> 
> This is just a documentation optimization, the protocol remains the same.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  docs/interop/vhost-user.rst | 39 +--------
>  1 file changed, 18 insertions(+), 21 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH V3 2/2] docs: vhost-user: Add Xen specific memory mapping support

2023-03-15 Thread Stefan Hajnoczi

On Thu, Mar 09, 2023 at 02:21:01PM +0530, Viresh Kumar wrote:
> The current model of memory mapping at the back-end works fine where a
> standard call to mmap() (for the respective file descriptor) is enough
> before the front-end can start accessing the guest memory.
> 
> There are other complex cases though where the back-end needs more
> information and simple mmap() isn't enough. For example Xen, a type-1
> hypervisor, currently supports memory mapping via two different methods,
> foreign-mapping (via /dev/privcmd) and grant-dev (via /dev/gntdev). In
> both these cases, the back-end needs to call mmap() and ioctl(), with
> extra information like the Xen domain-id of the guest whose memory we
> are trying to map.
> 
> Add a new protocol feature, 'VHOST_USER_PROTOCOL_F_XEN_MMAP', which lets
> the back-end know about the additional memory mapping requirements.
> When this feature is negotiated, the front-end will send the additional
> information within the memory regions themselves.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  docs/interop/vhost-user.rst | 21 +++++
>  1 file changed, 21 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH V3 0/2] qemu: vhost-user: Support Xen memory mapping quirks

2023-03-15 Thread Stefan Hajnoczi

On Thu, Mar 09, 2023 at 02:20:59PM +0530, Viresh Kumar wrote:
> Hello,
> 
> This patchset tries to update the vhost-user protocol to make it support 
> special
> memory mapping required in case of Xen hypervisor.
> 
> The first patch is mostly cleanup and second one introduces a new xen specific
> feature.
> 
> V2->V3:
> - Remove the extra message and instead update the memory regions to carry
>   additional data.
> 
> - Drop the one region one mmap relationship and allow back-end to map only 
> parts
>   of a region at once, required for Xen grant mappings.
> 
> - Additional cleanup patch 1/2.
> 
> V1->V2:
> - Make the custom mmap feature Xen specific, instead of being generic.
> - Clearly define which memory regions are impacted by this change.
> - Allow VHOST_USER_SET_XEN_MMAP to be called multiple times.
> - Additional Bit(2) property in flags.

Looks good, thanks!

Michael is the maintainer and this patch series will go through his tree.

Stefan


signature.asc
Description: PGP signature

[PATCH 05/13] block/export: wait for vhost-user-blk requests when draining

2023-04-03 Thread Stefan Hajnoczi

Each vhost-user-blk request runs in a coroutine. When the BlockBackend
enters a drained section we need to enter a quiescent state. Currently
any in-flight requests race with bdrv_drained_begin() because it is
unaware of vhost-user-blk requests.

When blk_co_preadv/pwritev()/etc returns it wakes the
bdrv_drained_begin() thread but vhost-user-blk request processing has
not yet finished. The request coroutine continues executing while the
main loop thread thinks it is in a drained section.

One example where this is unsafe is for blk_set_aio_context() where
bdrv_drained_begin() is called before .aio_context_detached() and
.aio_context_attach(). If request coroutines are still running after
bdrv_drained_begin(), then the AioContext could change underneath them
and they race with new requests processed in the new AioContext. This
could lead to virtqueue corruption, for example.

(This example is theoretical, I came across this while reading the
code and have not tried to reproduce it.)

It's easy to make bdrv_drained_begin() wait for in-flight requests: add
a .drained_poll() callback that checks the VuServer's in-flight counter.
VuServer just needs an API that returns true when there are requests in
flight. The in-flight counter needs to be atomic.

Signed-off-by: Stefan Hajnoczi 
---
 include/qemu/vhost-user-server.h |  4 +++-
 block/export/vhost-user-blk-server.c | 19 +++
 util/vhost-user-server.c | 14 ++
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/qemu/vhost-user-server.h b/include/qemu/vhost-user-server.h
index bc0ac9ddb6..b1c1cda886 100644
--- a/include/qemu/vhost-user-server.h
+++ b/include/qemu/vhost-user-server.h
@@ -40,8 +40,9 @@ typedef struct {
 int max_queues;
 const VuDevIface *vu_iface;
 
+unsigned int in_flight; /* atomic */
+
 /* Protected by ctx lock */
-unsigned int in_flight;
 bool wait_idle;
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
@@ -62,6 +63,7 @@ void vhost_user_server_stop(VuServer *server);
 
 void vhost_user_server_inc_in_flight(VuServer *server);
 void vhost_user_server_dec_in_flight(VuServer *server);
+bool vhost_user_server_has_in_flight(VuServer *server);
 
 void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 void vhost_user_server_detach_aio_context(VuServer *server);
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index e93f2ed6b4..dbf5207162 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -254,6 +254,22 @@ static void vu_blk_exp_request_shutdown(BlockExport *exp)
 vhost_user_server_stop(&vexp->vu_server);
 }
 
+/*
+ * Ensures that bdrv_drained_begin() waits until in-flight requests complete.
+ *
+ * Called with vexp->export.ctx acquired.
+ */
+static bool vu_blk_drained_poll(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+return vhost_user_server_has_in_flight(&vexp->vu_server);
+}
+
+static const BlockDevOps vu_blk_dev_ops = {
+.drained_poll  = vu_blk_drained_poll,
+};
+
 static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
  Error **errp)
 {
@@ -292,6 +308,7 @@ static int vu_blk_exp_create(BlockExport *exp, 
BlockExportOptions *opts,
 vu_blk_initialize_config(blk_bs(exp->blk), &vexp->blkcfg,
  logical_block_size, num_queues);
 
+blk_set_dev_ops(exp->blk, &vu_blk_dev_ops, vexp);
 blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
  vexp);
 
@@ -299,6 +316,7 @@ static int vu_blk_exp_create(BlockExport *exp, 
BlockExportOptions *opts,
  num_queues, &vu_blk_iface, errp)) {
 blk_remove_aio_context_notifier(exp->blk, blk_aio_attached,
 blk_aio_detach, vexp);
+blk_set_dev_ops(exp->blk, NULL, NULL);
 g_free(vexp->handler.serial);
 return -EADDRNOTAVAIL;
 }
@@ -312,6 +330,7 @@ static void vu_blk_exp_delete(BlockExport *exp)
 
 blk_remove_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
 vexp);
+blk_set_dev_ops(exp->blk, NULL, NULL);
 g_free(vexp->handler.serial);
 }
 
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 1622f8cfb3..2e6b640050 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -78,17 +78,23 @@ static void panic_cb(VuDev *vu_dev, const char *buf)
 void vhost_user_server_inc_in_flight(VuServer *server)
 {
 assert(!server->wait_idle);
-server->in_flight++;
+qatomic_inc(&server->in_flight);
 }
 
 void vhost_user_server_dec_in_flight(VuServer *server)
 {
-server->in_flight--;
-if (server->wait_idle && !server->in_flight) {
-aio_c

[PATCH 06/13] block/export: stop using is_external in vhost-user-blk server

2023-04-03 Thread Stefan Hajnoczi

vhost-user activity must be suspended during bdrv_drained_begin/end().
This prevents new requests from interfering with whatever is happening
in the drained section.

Previously this was done using aio_set_fd_handler()'s is_external
argument. In a multi-queue block layer world the aio_disable_external()
API cannot be used since multiple AioContext may be processing I/O, not
just one.

Switch to BlockDevOps->drained_begin/end() callbacks.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/vhost-user-blk-server.c | 43 ++--
 util/vhost-user-server.c | 10 +++
 2 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index dbf5207162..6e1bc196fb 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -207,22 +207,6 @@ static const VuDevIface vu_blk_iface = {
 .process_msg   = vu_blk_process_msg,
 };
 
-static void blk_aio_attached(AioContext *ctx, void *opaque)
-{
-VuBlkExport *vexp = opaque;
-
-vexp->export.ctx = ctx;
-vhost_user_server_attach_aio_context(&vexp->vu_server, ctx);
-}
-
-static void blk_aio_detach(void *opaque)
-{
-VuBlkExport *vexp = opaque;
-
-vhost_user_server_detach_aio_context(&vexp->vu_server);
-vexp->export.ctx = NULL;
-}
-
 static void
 vu_blk_initialize_config(BlockDriverState *bs,
  struct virtio_blk_config *config,
@@ -254,6 +238,25 @@ static void vu_blk_exp_request_shutdown(BlockExport *exp)
 vhost_user_server_stop(&vexp->vu_server);
 }
 
+/* Called with vexp->export.ctx acquired */
+static void vu_blk_drained_begin(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+vhost_user_server_detach_aio_context(&vexp->vu_server);
+}
+
+/* Called with vexp->export.blk AioContext acquired */
+static void vu_blk_drained_end(void *opaque)
+{
+VuBlkExport *vexp = opaque;
+
+/* Refresh AioContext in case it changed */
+vexp->export.ctx = blk_get_aio_context(vexp->export.blk);
+
+vhost_user_server_attach_aio_context(&vexp->vu_server, vexp->export.ctx);
+}
+
 /*
  * Ensures that bdrv_drained_begin() waits until in-flight requests complete.
  *
@@ -267,6 +270,8 @@ static bool vu_blk_drained_poll(void *opaque)
 }
 
 static const BlockDevOps vu_blk_dev_ops = {
+.drained_begin = vu_blk_drained_begin,
+.drained_end   = vu_blk_drained_end,
 .drained_poll  = vu_blk_drained_poll,
 };
 
@@ -309,13 +314,9 @@ static int vu_blk_exp_create(BlockExport *exp, 
BlockExportOptions *opts,
  logical_block_size, num_queues);
 
 blk_set_dev_ops(exp->blk, &vu_blk_dev_ops, vexp);
-blk_add_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
- vexp);
 
 if (!vhost_user_server_start(&vexp->vu_server, vu_opts->addr, exp->ctx,
  num_queues, &vu_blk_iface, errp)) {
-blk_remove_aio_context_notifier(exp->blk, blk_aio_attached,
-blk_aio_detach, vexp);
 blk_set_dev_ops(exp->blk, NULL, NULL);
 g_free(vexp->handler.serial);
 return -EADDRNOTAVAIL;
@@ -328,8 +329,6 @@ static void vu_blk_exp_delete(BlockExport *exp)
 {
 VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
 
-blk_remove_aio_context_notifier(exp->blk, blk_aio_attached, blk_aio_detach,
-vexp);
 blk_set_dev_ops(exp->blk, NULL, NULL);
 g_free(vexp->handler.serial);
 }
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 2e6b640050..332aea9306 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -278,7 +278,7 @@ set_watch(VuDev *vu_dev, int fd, int vu_evt,
 vu_fd_watch->fd = fd;
 vu_fd_watch->cb = cb;
 qemu_socket_set_nonblock(fd);
-aio_set_fd_handler(server->ioc->ctx, fd, true, kick_handler,
+aio_set_fd_handler(server->ioc->ctx, fd, false, kick_handler,
NULL, NULL, NULL, vu_fd_watch);
 vu_fd_watch->vu_dev = vu_dev;
 vu_fd_watch->pvt = pvt;
@@ -299,7 +299,7 @@ static void remove_watch(VuDev *vu_dev, int fd)
 if (!vu_fd_watch) {
 return;
 }
-aio_set_fd_handler(server->ioc->ctx, fd, true,
+aio_set_fd_handler(server->ioc->ctx, fd, false,
NULL, NULL, NULL, NULL, NULL);
 
 QTAILQ_REMOVE(&server->vu_fd_watches, vu_fd_watch, next);
@@ -362,7 +362,7 @@ void vhost_user_server_stop(VuServer *server)
 VuFdWatch *vu_fd_watch;
 
 QTAILQ_FOREACH(vu_fd_watch, &server->vu_fd_watches, next) {
-aio_set_fd_handler(server->ctx, vu_fd_watch->fd, true,
+aio_set_fd_handler(server->ctx, vu_fd_watch->fd, false,

[PATCH 03/13] block/export: only acquire AioContext once for vhost_user_server_stop()

2023-04-03 Thread Stefan Hajnoczi

vhost_user_server_stop() uses AIO_WAIT_WHILE(). AIO_WAIT_WHILE()
requires that AioContext is only acquired once.

Since blk_exp_request_shutdown() already acquires the AioContext it
shouldn't be acquired again in vhost_user_server_stop().

Signed-off-by: Stefan Hajnoczi 
---
 util/vhost-user-server.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 40f36ea214..5b6216069c 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -346,10 +346,9 @@ static void vu_accept(QIONetListener *listener, 
QIOChannelSocket *sioc,
 aio_context_release(server->ctx);
 }
 
+/* server->ctx acquired by caller */
 void vhost_user_server_stop(VuServer *server)
 {
-aio_context_acquire(server->ctx);
-
 qemu_bh_delete(server->restart_listener_bh);
 server->restart_listener_bh = NULL;
 
@@ -366,8 +365,6 @@ void vhost_user_server_stop(VuServer *server)
 AIO_WAIT_WHILE(server->ctx, server->co_trip);
 }
 
-aio_context_release(server->ctx);
-
 if (server->listener) {
 qio_net_listener_disconnect(server->listener);
 object_unref(OBJECT(server->listener));
-- 
2.39.2

[PATCH 02/13] virtio-scsi: stop using aio_disable_external() during unplug

2023-04-03 Thread Stefan Hajnoczi

This patch is part of an effort to remove the aio_disable_external()
API because it does not fit in a multi-queue block layer world where
many AioContexts may be submitting requests to the same disk.

The SCSI emulation code is already in good shape to stop using
aio_disable_external(). It was only used by commit 9c5aad84da1c
("virtio-scsi: fixed virtio_scsi_ctx_check failed when detaching scsi
disk") to ensure that virtio_scsi_hotunplug() works while the guest
driver is submitting I/O.

Ensure virtio_scsi_hotunplug() is safe as follows:

1. qdev_simple_device_unplug_cb() -> qdev_unrealize() ->
   device_set_realized() calls qatomic_set(&dev->realized, false) so
   that future scsi_device_get() calls return NULL because they exclude
   SCSIDevices with realized=false.

   That means virtio-scsi will reject new I/O requests to this
   SCSIDevice with VIRTIO_SCSI_S_BAD_TARGET even while
   virtio_scsi_hotunplug() is still executing. We are protected against
   new requests!

2. Add a call to scsi_device_purge_requests() from scsi_unrealize() so
   that in-flight requests are cancelled synchronously. This ensures
   that no in-flight requests remain once qdev_simple_device_unplug_cb()
   returns.

Thanks to these two conditions we don't need aio_disable_external()
anymore.

Cc: Zhengui Li 
Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/scsi-disk.c   | 1 +
 hw/scsi/virtio-scsi.c | 3 ---
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 97c9b1c8cd..e01bd84541 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2522,6 +2522,7 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 
 static void scsi_unrealize(SCSIDevice *dev)
 {
+scsi_device_purge_requests(dev, SENSE_CODE(RESET));
 del_boot_device_lchs(&dev->qdev, NULL);
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 000961446c..a02f9233ec 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1061,11 +1061,8 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 VirtIODevice *vdev = VIRTIO_DEVICE(hotplug_dev);
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 SCSIDevice *sd = SCSI_DEVICE(dev);
-AioContext *ctx = s->ctx ?: qemu_get_aio_context();
 
-aio_disable_external(ctx);
 qdev_simple_device_unplug_cb(hotplug_dev, dev, errp);
-aio_enable_external(ctx);
 
 if (s->ctx) {
 virtio_scsi_acquire(s);
-- 
2.39.2

[PATCH 01/13] virtio-scsi: avoid race between unplug and transport event

2023-04-03 Thread Stefan Hajnoczi

Only report a transport reset event to the guest after the SCSIDevice
has been unrealized by qdev_simple_device_unplug_cb().

qdev_simple_device_unplug_cb() sets the SCSIDevice's qdev.realized field
to false so that scsi_device_find/get() no longer see it.

scsi_target_emulate_report_luns() also needs to be updated to filter out
SCSIDevices that are unrealized.

These changes ensure that the guest driver does not see the SCSIDevice
that's being unplugged if it responds very quickly to the transport
reset event.

Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/scsi-bus.c|  3 ++-
 hw/scsi/virtio-scsi.c | 18 +-
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index c97176110c..f9bd064833 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -487,7 +487,8 @@ static bool scsi_target_emulate_report_luns(SCSITargetReq 
*r)
 DeviceState *qdev = kid->child;
 SCSIDevice *dev = SCSI_DEVICE(qdev);
 
-if (dev->channel == channel && dev->id == id && dev->lun != 0) {
+if (dev->channel == channel && dev->id == id && dev->lun != 0 &&
+qatomic_load_acquire(&dev->qdev.realized)) {
 store_lun(tmp, dev->lun);
 g_byte_array_append(buf, tmp, 8);
 len += 8;
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 612c525d9d..000961446c 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -1063,15 +1063,6 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 SCSIDevice *sd = SCSI_DEVICE(dev);
 AioContext *ctx = s->ctx ?: qemu_get_aio_context();
 
-if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) {
-virtio_scsi_acquire(s);
-virtio_scsi_push_event(s, sd,
-   VIRTIO_SCSI_T_TRANSPORT_RESET,
-   VIRTIO_SCSI_EVT_RESET_REMOVED);
-scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
-virtio_scsi_release(s);
-}
-
 aio_disable_external(ctx);
 qdev_simple_device_unplug_cb(hotplug_dev, dev, errp);
 aio_enable_external(ctx);
@@ -1082,6 +1073,15 @@ static void virtio_scsi_hotunplug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 blk_set_aio_context(sd->conf.blk, qemu_get_aio_context(), NULL);
 virtio_scsi_release(s);
 }
+
+if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) {
+virtio_scsi_acquire(s);
+virtio_scsi_push_event(s, sd,
+   VIRTIO_SCSI_T_TRANSPORT_RESET,
+   VIRTIO_SCSI_EVT_RESET_REMOVED);
+scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED));
+virtio_scsi_release(s);
+}
 }
 
 static struct SCSIBusInfo virtio_scsi_scsi_info = {
-- 
2.39.2

[PATCH 09/13] hw/xen: do not set is_external=true on evtchn fds

2023-04-03 Thread Stefan Hajnoczi

is_external=true suspends fd handlers between aio_disable_external() and
aio_enable_external(). The block layer's drain operation uses this
mechanism to prevent new I/O from sneaking in between
bdrv_drained_begin() and bdrv_drained_end().

The xen-block device actually works fine with is_external=false because
BlockBackend requests are already queued between bdrv_drained_begin()
and bdrv_drained_end(). Since the Xen ring size is finite, request
queuing will stop once the ring is full and memory usage is bounded.
After bdrv_drained_end() the BlockBackend requests will resume and
xen-block's processing will continue.

This is part of ongoing work to remove the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 hw/xen/xen-bus.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/xen/xen-bus.c b/hw/xen/xen-bus.c
index c59850b1de..c4fd26abe1 100644
--- a/hw/xen/xen-bus.c
+++ b/hw/xen/xen-bus.c
@@ -842,11 +842,11 @@ void xen_device_set_event_channel_context(XenDevice 
*xendev,
 }
 
 if (channel->ctx)
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), 
false,
NULL, NULL, NULL, NULL, NULL);
 
 channel->ctx = ctx;
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), false,
xen_device_event, NULL, xen_device_poll, NULL, channel);
 }
 
@@ -920,7 +920,7 @@ void xen_device_unbind_event_channel(XenDevice *xendev,
 
 QLIST_REMOVE(channel, list);
 
-aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), true,
+aio_set_fd_handler(channel->ctx, qemu_xen_evtchn_fd(channel->xeh), false,
NULL, NULL, NULL, NULL, NULL);
 
 if (qemu_xen_evtchn_unbind(channel->xeh, channel->local_port) < 0) {
-- 
2.39.2

[PATCH 04/13] util/vhost-user-server: rename refcount to in_flight counter

2023-04-03 Thread Stefan Hajnoczi

The VuServer object has a refcount field and ref/unref APIs. The name is
confusing because it's actually an in-flight request counter instead of
a refcount.

Normally a refcount destroys the object upon reaching zero. The VuServer
counter is used to wake up the vhost-user coroutine when there are no
more requests.

Avoid confusing by renaming refcount and ref/unref to in_flight and
inc/dec.

Signed-off-by: Stefan Hajnoczi 
---
 include/qemu/vhost-user-server.h |  6 +++---
 block/export/vhost-user-blk-server.c | 11 +++
 util/vhost-user-server.c | 14 +++---
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/qemu/vhost-user-server.h b/include/qemu/vhost-user-server.h
index 25c72433ca..bc0ac9ddb6 100644
--- a/include/qemu/vhost-user-server.h
+++ b/include/qemu/vhost-user-server.h
@@ -41,7 +41,7 @@ typedef struct {
 const VuDevIface *vu_iface;
 
 /* Protected by ctx lock */
-unsigned int refcount;
+unsigned int in_flight;
 bool wait_idle;
 VuDev vu_dev;
 QIOChannel *ioc; /* The I/O channel with the client */
@@ -60,8 +60,8 @@ bool vhost_user_server_start(VuServer *server,
 
 void vhost_user_server_stop(VuServer *server);
 
-void vhost_user_server_ref(VuServer *server);
-void vhost_user_server_unref(VuServer *server);
+void vhost_user_server_inc_in_flight(VuServer *server);
+void vhost_user_server_dec_in_flight(VuServer *server);
 
 void vhost_user_server_attach_aio_context(VuServer *server, AioContext *ctx);
 void vhost_user_server_detach_aio_context(VuServer *server);
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index 3409d9e02e..e93f2ed6b4 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -49,7 +49,10 @@ static void vu_blk_req_complete(VuBlkReq *req, size_t in_len)
 free(req);
 }
 
-/* Called with server refcount increased, must decrease before returning */
+/*
+ * Called with server in_flight counter increased, must decrease before
+ * returning.
+ */
 static void coroutine_fn vu_blk_virtio_process_req(void *opaque)
 {
 VuBlkReq *req = opaque;
@@ -67,12 +70,12 @@ static void coroutine_fn vu_blk_virtio_process_req(void 
*opaque)
 in_num, out_num);
 if (in_len < 0) {
 free(req);
-vhost_user_server_unref(server);
+vhost_user_server_dec_in_flight(server);
 return;
 }
 
 vu_blk_req_complete(req, in_len);
-vhost_user_server_unref(server);
+vhost_user_server_dec_in_flight(server);
 }
 
 static void vu_blk_process_vq(VuDev *vu_dev, int idx)
@@ -94,7 +97,7 @@ static void vu_blk_process_vq(VuDev *vu_dev, int idx)
 Coroutine *co =
 qemu_coroutine_create(vu_blk_virtio_process_req, req);
 
-vhost_user_server_ref(server);
+vhost_user_server_inc_in_flight(server);
 qemu_coroutine_enter(co);
 }
 }
diff --git a/util/vhost-user-server.c b/util/vhost-user-server.c
index 5b6216069c..1622f8cfb3 100644
--- a/util/vhost-user-server.c
+++ b/util/vhost-user-server.c
@@ -75,16 +75,16 @@ static void panic_cb(VuDev *vu_dev, const char *buf)
 error_report("vu_panic: %s", buf);
 }
 
-void vhost_user_server_ref(VuServer *server)
+void vhost_user_server_inc_in_flight(VuServer *server)
 {
 assert(!server->wait_idle);
-server->refcount++;
+server->in_flight++;
 }
 
-void vhost_user_server_unref(VuServer *server)
+void vhost_user_server_dec_in_flight(VuServer *server)
 {
-server->refcount--;
-if (server->wait_idle && !server->refcount) {
+server->in_flight--;
+if (server->wait_idle && !server->in_flight) {
 aio_co_wake(server->co_trip);
 }
 }
@@ -192,13 +192,13 @@ static coroutine_fn void vu_client_trip(void *opaque)
 /* Keep running */
 }
 
-if (server->refcount) {
+if (server->in_flight) {
 /* Wait for requests to complete before we can unmap the memory */
 server->wait_idle = true;
 qemu_coroutine_yield();
 server->wait_idle = false;
 }
-assert(server->refcount == 0);
+assert(server->in_flight == 0);
 
 vu_deinit(vu_dev);
 
-- 
2.39.2

[PATCH 00/13] block: remove aio_disable_external() API

2023-04-03 Thread Stefan Hajnoczi

The aio_disable_external() API temporarily suspends file descriptor monitoring
in the event loop. The block layer uses this to prevent new I/O requests being
submitted from the guest and elsewhere between bdrv_drained_begin() and
bdrv_drained_end().

While the block layer still needs to prevent new I/O requests in drained
sections, the aio_disable_external() API can be replaced with
.drained_begin/end/poll() callbacks that have been added to BdrvChildClass and
BlockDevOps.

This newer .bdrained_begin/end/poll() approach is attractive because it works
without specifying a specific AioContext. The block layer is moving towards
multi-queue and that means multiple AioContexts may be processing I/O
simultaneously.

The aio_disable_external() was always somewhat hacky. It suspends all file
descriptors that were registered with is_external=true, even if they have
nothing to do with the BlockDriverState graph nodes that are being drained.
It's better to solve a block layer problem in the block layer than to have an
odd event loop API solution.

That covers the motivation for this change, now on to the specifics of this
series:

While it would be nice if a single conceptual approach could be applied to all
is_external=true file descriptors, I ended up looking at callers on a
case-by-case basis. There are two general ways I migrated code away from
is_external=true:

1. Block exports are typically best off unregistering fds in .drained_begin()
   and registering them again in .drained_end(). The .drained_poll() function
   waits for in-flight requests to finish using a reference counter.

2. Emulated storage controllers like virtio-blk and virtio-scsi are a little
   simpler. They can rely on BlockBackend's request queuing during drain
   feature. Guest I/O request coroutines are suspended in a drained section and
   resume upon the end of the drained section.

The first two virtio-scsi patches were already sent as a separate series. I
included them because they are necessary in order to fully remove
aio_disable_external().

Based-on: 087bc644b7634436ca9d52fe58ba9234e2bef026 (kevin/block-next)

Stefan Hajnoczi (13):
  virtio-scsi: avoid race between unplug and transport event
  virtio-scsi: stop using aio_disable_external() during unplug
  block/export: only acquire AioContext once for
vhost_user_server_stop()
  util/vhost-user-server: rename refcount to in_flight counter
  block/export: wait for vhost-user-blk requests when draining
  block/export: stop using is_external in vhost-user-blk server
  virtio: do not set is_external=true on host notifiers
  hw/xen: do not use aio_set_fd_handler(is_external=true) in
xen_xenstore
  hw/xen: do not set is_external=true on evtchn fds
  block/export: rewrite vduse-blk drain code
  block/fuse: take AioContext lock around blk_exp_ref/unref()
  block/fuse: do not set is_external=true on FUSE fd
  aio: remove aio_disable_external() API

 include/block/aio.h  |  55 ---
 include/qemu/vhost-user-server.h |   8 +-
 util/aio-posix.h |   1 -
 block.c  |   7 --
 block/blkio.c|  15 +--
 block/curl.c |  10 +-
 block/export/fuse.c  |  62 -
 block/export/vduse-blk.c | 132 +++
 block/export/vhost-user-blk-server.c |  73 +--
 block/io.c   |   2 -
 block/io_uring.c |   4 +-
 block/iscsi.c|   3 +-
 block/linux-aio.c|   4 +-
 block/nfs.c  |   5 +-
 block/nvme.c |   8 +-
 block/ssh.c  |   4 +-
 block/win32-aio.c|   6 +-
 hw/i386/kvm/xen_xenstore.c   |   2 +-
 hw/scsi/scsi-bus.c   |   3 +-
 hw/scsi/scsi-disk.c  |   1 +
 hw/scsi/virtio-scsi.c|  21 ++---
 hw/virtio/virtio.c   |   6 +-
 hw/xen/xen-bus.c |   6 +-
 io/channel-command.c |   6 +-
 io/channel-file.c|   3 +-
 io/channel-socket.c  |   3 +-
 migration/rdma.c |  16 ++--
 tests/unit/test-aio.c|  27 +-
 tests/unit/test-fdmon-epoll.c|  73 ---
 util/aio-posix.c |  20 +---
 util/aio-win32.c |   8 +-
 util/async.c |   3 +-
 util/fdmon-epoll.c   |  10 --
 util/fdmon-io_uring.c|   8 +-
 util/fdmon-poll.c|   3 +-
 util/main-loop.c |   7 +-
 util/qemu-coroutine-io.c |   7 +-
 util/vhost-user-server.c |  38 
 tests/unit/meson.build   |   3 -
 39 files changed, 298 insertions(+), 375 deletions(-)
 delete mode 100644 tests/unit/test-fdmon-epoll.c

-- 
2.39.2

[PATCH 07/13] virtio: do not set is_external=true on host notifiers

2023-04-03 Thread Stefan Hajnoczi

Host notifiers trigger virtqueue processing. There are critical sections
when new I/O requests must not be submitted because they would cause
interference.

In the past this was solved using aio_set_event_notifiers()
is_external=true argument, which disables fd monitoring between
aio_disable/enable_external() calls. This API is not multi-queue block
layer friendly because it requires knowledge of the specific AioContext.
In a multi-queue block layer world any thread can submit I/O and we
don't know which AioContexts are currently involved.

virtio-blk and virtio-scsi are the only users that depend on
is_external=true. Both rely on the block layer, where we can take
advantage of the existing request queuing behavior that happens during
drained sections. The block layer's drained sections are the only user
of aio_disable_external().

After this patch the virtqueues will be processed during drained
section, but submitted I/O requests will be queued in the BlockBackend.
Queued requests are resumed when the drained section ends. Therefore,
the BlockBackend is still quiesced during drained sections but we no
longer rely on is_external=true to achieve this.

Note that virtqueues have a finite size, so queuing requests does not
lead to unbounded memory usage.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 98c4819fcc..dcd7aabb4e 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3491,7 +3491,7 @@ static void 
virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
 
 void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
@@ -3508,14 +3508,14 @@ void virtio_queue_aio_attach_host_notifier(VirtQueue 
*vq, AioContext *ctx)
  */
 void virtio_queue_aio_attach_host_notifier_no_poll(VirtQueue *vq, AioContext 
*ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true,
+aio_set_event_notifier(ctx, &vq->host_notifier, false,
virtio_queue_host_notifier_read,
NULL, NULL);
 }
 
 void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx)
 {
-aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, NULL);
+aio_set_event_notifier(ctx, &vq->host_notifier, false, NULL, NULL, NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
 virtio_queue_host_notifier_read(&vq->host_notifier);
-- 
2.39.2

[PATCH 08/13] hw/xen: do not use aio_set_fd_handler(is_external=true) in xen_xenstore

2023-04-03 Thread Stefan Hajnoczi

There is no need to suspend activity between aio_disable_external() and
aio_enable_external(), which is mainly used for the block layer's drain
operation.

This is part of ongoing work to remove the aio_disable_external() API.

Signed-off-by: Stefan Hajnoczi 
---
 hw/i386/kvm/xen_xenstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/kvm/xen_xenstore.c b/hw/i386/kvm/xen_xenstore.c
index 900679af8a..6e81bc8791 100644
--- a/hw/i386/kvm/xen_xenstore.c
+++ b/hw/i386/kvm/xen_xenstore.c
@@ -133,7 +133,7 @@ static void xen_xenstore_realize(DeviceState *dev, Error 
**errp)
 error_setg(errp, "Xenstore evtchn port init failed");
 return;
 }
-aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), true,
+aio_set_fd_handler(qemu_get_aio_context(), xen_be_evtchn_fd(s->eh), false,
xen_xenstore_event, NULL, NULL, NULL, s);
 
 s->impl = xs_impl_create(xen_domid);
-- 
2.39.2

[PATCH 12/13] block/fuse: do not set is_external=true on FUSE fd

2023-04-03 Thread Stefan Hajnoczi

This is part of ongoing work to remove the aio_disable_external() API.

Use BlockDevOps .drained_begin/end/poll() instead of
aio_set_fd_handler(is_external=true).

As a side-effect the FUSE export now follows AioContext changes like the
other export types.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/fuse.c | 58 +++--
 1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 18394f9e07..83bccf046b 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,6 +50,7 @@ typedef struct FuseExport {
 
 struct fuse_session *fuse_session;
 struct fuse_buf fuse_buf;
+unsigned int in_flight; /* atomic */
 bool mounted, fd_handler_set_up;
 
 char *mountpoint;
@@ -78,6 +79,42 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_export_drained_begin(void *opaque)
+{
+FuseExport *exp = opaque;
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   NULL, NULL, NULL, NULL, NULL);
+exp->fd_handler_set_up = false;
+}
+
+static void fuse_export_drained_end(void *opaque)
+{
+FuseExport *exp = opaque;
+
+/* Refresh AioContext in case it changed */
+exp->common.ctx = blk_get_aio_context(exp->common.blk);
+
+aio_set_fd_handler(exp->common.ctx,
+   fuse_session_fd(exp->fuse_session), false,
+   read_from_fuse_export, NULL, NULL, NULL, exp);
+exp->fd_handler_set_up = true;
+}
+
+static bool fuse_export_drained_poll(void *opaque)
+{
+FuseExport *exp = opaque;
+
+return qatomic_read(&exp->in_flight) > 0;
+}
+
+static const BlockDevOps fuse_export_blk_dev_ops = {
+.drained_begin = fuse_export_drained_begin,
+.drained_end   = fuse_export_drained_end,
+.drained_poll  = fuse_export_drained_poll,
+};
+
 static int fuse_export_create(BlockExport *blk_exp,
   BlockExportOptions *blk_exp_args,
   Error **errp)
@@ -101,6 +138,15 @@ static int fuse_export_create(BlockExport *blk_exp,
 }
 }
 
+blk_set_dev_ops(exp->common.blk, &fuse_export_blk_dev_ops, exp);
+
+/*
+ * We handle draining ourselves using an in-flight counter and by disabling
+ * the FUSE fd handler. Do not queue BlockBackend requests, they need to
+ * complete so the in-flight counter reaches zero.
+ */
+blk_set_disable_request_queuing(exp->common.blk, true);
+
 init_exports_table();
 
 /*
@@ -224,7 +270,7 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
 
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
read_from_fuse_export, NULL, NULL, NULL, exp);
 exp->fd_handler_set_up = true;
 
@@ -248,6 +294,8 @@ static void read_from_fuse_export(void *opaque)
 blk_exp_ref(&exp->common);
 aio_context_release(exp->common.ctx);
 
+qatomic_inc(&exp->in_flight);
+
 do {
 ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
 } while (ret == -EINTR);
@@ -258,6 +306,10 @@ static void read_from_fuse_export(void *opaque)
 fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
+if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
+}
+
 aio_context_acquire(exp->common.ctx);
 blk_exp_unref(&exp->common);
 aio_context_release(exp->common.ctx);
@@ -272,7 +324,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 
 if (exp->fd_handler_set_up) {
 aio_set_fd_handler(exp->common.ctx,
-   fuse_session_fd(exp->fuse_session), true,
+   fuse_session_fd(exp->fuse_session), false,
NULL, NULL, NULL, NULL, NULL);
 exp->fd_handler_set_up = false;
 }
@@ -291,6 +343,8 @@ static void fuse_export_delete(BlockExport *blk_exp)
 {
 FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
+blk_set_dev_ops(exp->common.blk, NULL, NULL);
+
 if (exp->fuse_session) {
 if (exp->mounted) {
 fuse_session_unmount(exp->fuse_session);
-- 
2.39.2

[PATCH 10/13] block/export: rewrite vduse-blk drain code

2023-04-03 Thread Stefan Hajnoczi

vduse_blk_detach_ctx() waits for in-flight requests using
AIO_WAIT_WHILE(). This is not allowed according to a comment in
bdrv_set_aio_context_commit():

  /*
   * Take the old AioContex when detaching it from bs.
   * At this point, new_context lock is already acquired, and we are now
   * also taking old_context. This is safe as long as bdrv_detach_aio_context
   * does not call AIO_POLL_WHILE().
   */

Use this opportunity to rewrite the drain code in vduse-blk:

- Use the BlockExport refcount so that vduse_blk_exp_delete() is only
  called when there are no more requests in flight.

- Implement .drained_poll() so in-flight request coroutines are stopped
  by the time .bdrv_detach_aio_context() is called.

- Remove AIO_WAIT_WHILE() from vduse_blk_detach_ctx() to solve the
  .bdrv_detach_aio_context() constraint violation. It's no longer
  needed due to the previous changes.

- Always handle the VDUSE file descriptor, even in drained sections. The
  VDUSE file descriptor doesn't submit I/O, so it's safe to handle it in
  drained sections. This ensures that the VDUSE kernel code gets a fast
  response.

- Suspend virtqueue fd handlers in .drained_begin() and resume them in
  .drained_end(). This eliminates the need for the
  aio_set_fd_handler(is_external=true) flag, which is being removed from
  QEMU.

This is a long list but splitting it into individual commits would
probably lead to git bisect failures - the changes are all related.

Signed-off-by: Stefan Hajnoczi 
---
 block/export/vduse-blk.c | 132 +++
 1 file changed, 93 insertions(+), 39 deletions(-)

diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index f7ae44e3ce..35dc8fcf45 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -31,7 +31,8 @@ typedef struct VduseBlkExport {
 VduseDev *dev;
 uint16_t num_queues;
 char *recon_file;
-unsigned int inflight;
+unsigned int inflight; /* atomic */
+bool vqs_started;
 } VduseBlkExport;
 
 typedef struct VduseBlkReq {
@@ -41,13 +42,24 @@ typedef struct VduseBlkReq {
 
 static void vduse_blk_inflight_inc(VduseBlkExport *vblk_exp)
 {
-vblk_exp->inflight++;
+if (qatomic_fetch_inc(&vblk_exp->inflight) == 0) {
+/* Prevent export from being deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_ref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
+}
 }
 
 static void vduse_blk_inflight_dec(VduseBlkExport *vblk_exp)
 {
-if (--vblk_exp->inflight == 0) {
+if (qatomic_fetch_dec(&vblk_exp->inflight) == 1) {
+/* Wake AIO_WAIT_WHILE() */
 aio_wait_kick();
+
+/* Now the export can be deleted */
+aio_context_acquire(vblk_exp->export.ctx);
+blk_exp_unref(&vblk_exp->export);
+aio_context_release(vblk_exp->export.ctx);
 }
 }
 
@@ -124,8 +136,12 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
 
+if (!vblk_exp->vqs_started) {
+return; /* vduse_blk_drained_end() will start vqs later */
+}
+
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, on_vduse_vq_kick, NULL, NULL, NULL, vq);
+   false, on_vduse_vq_kick, NULL, NULL, NULL, vq);
 /* Make sure we don't miss any kick afer reconnecting */
 eventfd_write(vduse_queue_get_fd(vq), 1);
 }
@@ -133,9 +149,14 @@ static void vduse_blk_enable_queue(VduseDev *dev, 
VduseVirtq *vq)
 static void vduse_blk_disable_queue(VduseDev *dev, VduseVirtq *vq)
 {
 VduseBlkExport *vblk_exp = vduse_dev_get_priv(dev);
+int fd = vduse_queue_get_fd(vq);
 
-aio_set_fd_handler(vblk_exp->export.ctx, vduse_queue_get_fd(vq),
-   true, NULL, NULL, NULL, NULL, NULL);
+if (fd < 0) {
+return;
+}
+
+aio_set_fd_handler(vblk_exp->export.ctx, fd, false,
+   NULL, NULL, NULL, NULL, NULL);
 }
 
 static const VduseOps vduse_blk_ops = {
@@ -152,42 +173,19 @@ static void on_vduse_dev_kick(void *opaque)
 
 static void vduse_blk_attach_ctx(VduseBlkExport *vblk_exp, AioContext *ctx)
 {
-int i;
-
 aio_set_fd_handler(vblk_exp->export.ctx, vduse_dev_get_fd(vblk_exp->dev),
-   true, on_vduse_dev_kick, NULL, NULL, NULL,
+   false, on_vduse_dev_kick, NULL, NULL, NULL,
vblk_exp->dev);
 
-for (i = 0; i < vblk_exp->num_queues; i++) {
-VduseVirtq *vq = vduse_dev_get_queue(vblk_exp->dev, i);
-int fd = vduse_queue_get_fd(vq);
-
-if (fd < 0) {
-continue;
-}
-aio_set_fd_handler(vblk_exp->export.ctx, fd, true,
-   on_vduse_vq_kick, NULL, NULL, NULL, vq);
-}
+/* Virtqueues are handled by vduse_blk_drained_end(

[PATCH 13/13] aio: remove aio_disable_external() API

2023-04-03 Thread Stefan Hajnoczi

All callers now pass is_external=false to aio_set_fd_handler() and
aio_set_event_notifier(). The aio_disable_external() API that
temporarily disables fd handlers that were registered is_external=true
is therefore dead code.

Remove aio_disable_external(), aio_enable_external(), and the
is_external arguments to aio_set_fd_handler() and
aio_set_event_notifier().

The entire test-fdmon-epoll test is removed because its sole purpose was
testing aio_disable_external().

Parts of this patch were generated using the following coccinelle
(https://coccinelle.lip6.fr/) semantic patch:

  @@
  expression ctx, fd, is_external, io_read, io_write, io_poll, io_poll_ready, 
opaque;
  @@
  - aio_set_fd_handler(ctx, fd, is_external, io_read, io_write, io_poll, 
io_poll_ready, opaque)
  + aio_set_fd_handler(ctx, fd, io_read, io_write, io_poll, io_poll_ready, 
opaque)

  @@
  expression ctx, notifier, is_external, io_read, io_poll, io_poll_ready;
  @@
  - aio_set_event_notifier(ctx, notifier, is_external, io_read, io_poll, 
io_poll_ready)
  + aio_set_event_notifier(ctx, notifier, io_read, io_poll, io_poll_ready)

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h   | 55 --
 util/aio-posix.h  |  1 -
 block.c   |  7 
 block/blkio.c | 15 +++
 block/curl.c  | 10 ++---
 block/export/fuse.c   |  8 ++--
 block/export/vduse-blk.c  | 10 ++---
 block/io.c|  2 -
 block/io_uring.c  |  4 +-
 block/iscsi.c |  3 +-
 block/linux-aio.c |  4 +-
 block/nfs.c   |  5 +--
 block/nvme.c  |  8 ++--
 block/ssh.c   |  4 +-
 block/win32-aio.c |  6 +--
 hw/i386/kvm/xen_xenstore.c|  2 +-
 hw/virtio/virtio.c|  6 +--
 hw/xen/xen-bus.c  |  6 +--
 io/channel-command.c  |  6 +--
 io/channel-file.c |  3 +-
 io/channel-socket.c   |  3 +-
 migration/rdma.c  | 16 
 tests/unit/test-aio.c | 27 +
 tests/unit/test-fdmon-epoll.c | 73 ---
 util/aio-posix.c  | 20 +++---
 util/aio-win32.c  |  8 +---
 util/async.c  |  3 +-
 util/fdmon-epoll.c| 10 -
 util/fdmon-io_uring.c |  8 +---
 util/fdmon-poll.c |  3 +-
 util/main-loop.c  |  7 ++--
 util/qemu-coroutine-io.c  |  7 ++--
 util/vhost-user-server.c  | 11 +++---
 tests/unit/meson.build|  3 --
 34 files changed, 75 insertions(+), 289 deletions(-)
 delete mode 100644 tests/unit/test-fdmon-epoll.c

diff --git a/include/block/aio.h b/include/block/aio.h
index e267d918fd..d4ce01ea08 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -467,7 +467,6 @@ bool aio_poll(AioContext *ctx, bool blocking);
  */
 void aio_set_fd_handler(AioContext *ctx,
 int fd,
-bool is_external,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
@@ -483,7 +482,6 @@ void aio_set_fd_handler(AioContext *ctx,
  */
 void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
-bool is_external,
 EventNotifierHandler *io_read,
 AioPollFn *io_poll,
 EventNotifierHandler *io_poll_ready);
@@ -612,59 +610,6 @@ static inline void aio_timer_init(AioContext *ctx,
  */
 int64_t aio_compute_timeout(AioContext *ctx);
 
-/**
- * aio_disable_external:
- * @ctx: the aio context
- *
- * Disable the further processing of external clients.
- */
-static inline void aio_disable_external(AioContext *ctx)
-{
-qatomic_inc(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_enable_external:
- * @ctx: the aio context
- *
- * Enable the processing of external clients.
- */
-static inline void aio_enable_external(AioContext *ctx)
-{
-int old;
-
-old = qatomic_fetch_dec(&ctx->external_disable_cnt);
-assert(old > 0);
-if (old == 1) {
-/* Kick event loop so it re-arms file descriptors */
-aio_notify(ctx);
-}
-}
-
-/**
- * aio_external_disabled:
- * @ctx: the aio context
- *
- * Return true if the external clients are disabled.
- */
-static inline bool aio_external_disabled(AioContext *ctx)
-{
-return qatomic_read(&ctx->external_disable_cnt);
-}
-
-/**
- * aio_node_check:
- * @ctx: the aio context
- * @is_external: Whether or not the checked node is an external event source.
- *
- * Check if the node's is_external flag is okay to be polled by the ctx at this
- * moment. True means green light.
- */
-static inline bool aio_node_check(AioContext *ctx, bool is_external)
-{
-return !is_external || !qatomic_read(&ctx->external_disable_cnt);
-}
-
 /**

[PATCH 11/13] block/fuse: take AioContext lock around blk_exp_ref/unref()

2023-04-03 Thread Stefan Hajnoczi

These functions must be called with the AioContext acquired:

  /* Callers must hold exp->ctx lock */
  void blk_exp_ref(BlockExport *exp)
  ...
  /* Callers must hold exp->ctx lock */
  void blk_exp_unref(BlockExport *exp)

Signed-off-by: Stefan Hajnoczi 
---
 block/export/fuse.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 06fa41079e..18394f9e07 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -244,7 +244,9 @@ static void read_from_fuse_export(void *opaque)
 FuseExport *exp = opaque;
 int ret;
 
+aio_context_acquire(exp->common.ctx);
 blk_exp_ref(&exp->common);
+aio_context_release(exp->common.ctx);
 
 do {
 ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
@@ -256,7 +258,9 @@ static void read_from_fuse_export(void *opaque)
 fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
+aio_context_acquire(exp->common.ctx);
 blk_exp_unref(&exp->common);
+aio_context_release(exp->common.ctx);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
-- 
2.39.2

Re: [PATCH 01/13] virtio-scsi: avoid race between unplug and transport event

2023-04-04 Thread Stefan Hajnoczi

On Mon, Apr 03, 2023 at 10:47:11PM +0200, Philippe Mathieu-Daudé wrote:
> On 3/4/23 20:29, Stefan Hajnoczi wrote:
> > Only report a transport reset event to the guest after the SCSIDevice
> > has been unrealized by qdev_simple_device_unplug_cb().
> > 
> > qdev_simple_device_unplug_cb() sets the SCSIDevice's qdev.realized field
> > to false so that scsi_device_find/get() no longer see it.
> > 
> > scsi_target_emulate_report_luns() also needs to be updated to filter out
> > SCSIDevices that are unrealized.
> > 
> > These changes ensure that the guest driver does not see the SCSIDevice
> > that's being unplugged if it responds very quickly to the transport
> > reset event.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >   hw/scsi/scsi-bus.c|  3 ++-
> >   hw/scsi/virtio-scsi.c | 18 +-
> >   2 files changed, 11 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
> > index c97176110c..f9bd064833 100644
> > --- a/hw/scsi/scsi-bus.c
> > +++ b/hw/scsi/scsi-bus.c
> > @@ -487,7 +487,8 @@ static bool 
> > scsi_target_emulate_report_luns(SCSITargetReq *r)
> >   DeviceState *qdev = kid->child;
> >   SCSIDevice *dev = SCSI_DEVICE(qdev);
> > -if (dev->channel == channel && dev->id == id && dev->lun != 0) 
> > {
> > +if (dev->channel == channel && dev->id == id && dev->lun != 0 
> > &&
> > +qatomic_load_acquire(&dev->qdev.realized)) {
> 
> Would this be more useful as a qdev_is_realized() helper?

Yes. There are no other users, but I think a helper makes sense.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 11/13] block/fuse: take AioContext lock around blk_exp_ref/unref()

2023-04-04 Thread Stefan Hajnoczi

On Tue, Apr 04, 2023 at 03:46:34PM +0200, Paolo Bonzini wrote:
> On 4/3/23 20:30, Stefan Hajnoczi wrote:
> > These functions must be called with the AioContext acquired:
> > 
> >/* Callers must hold exp->ctx lock */
> >void blk_exp_ref(BlockExport *exp)
> >...
> >/* Callers must hold exp->ctx lock */
> >void blk_exp_unref(BlockExport *exp)
> > 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >   block/export/fuse.c | 4 
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > index 06fa41079e..18394f9e07 100644
> > --- a/block/export/fuse.c
> > +++ b/block/export/fuse.c
> > @@ -244,7 +244,9 @@ static void read_from_fuse_export(void *opaque)
> >   FuseExport *exp = opaque;
> >   int ret;
> > +aio_context_acquire(exp->common.ctx);
> >   blk_exp_ref(&exp->common);
> > +aio_context_release(exp->common.ctx);
> >   do {
> >   ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
> > @@ -256,7 +258,9 @@ static void read_from_fuse_export(void *opaque)
> >   fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
> >   out:
> > +aio_context_acquire(exp->common.ctx);
> >   blk_exp_unref(&exp->common);
> > +aio_context_release(exp->common.ctx);
> >   }
> 
> Since the actual thread-unsafe work is done in a bottom half, perhaps
> instead you can use qatomic_inc and qatomic_fetch_dec in
> blk_exp_{ref,unref}?

Sure, I'll give that a try in the next revision.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 00/13] block: remove aio_disable_external() API

2023-04-04 Thread Stefan Hajnoczi

On Tue, Apr 04, 2023 at 03:43:20PM +0200, Paolo Bonzini wrote:
> On 4/3/23 20:29, Stefan Hajnoczi wrote:
> > The aio_disable_external() API temporarily suspends file descriptor 
> > monitoring
> > in the event loop. The block layer uses this to prevent new I/O requests 
> > being
> > submitted from the guest and elsewhere between bdrv_drained_begin() and
> > bdrv_drained_end().
> > 
> > While the block layer still needs to prevent new I/O requests in drained
> > sections, the aio_disable_external() API can be replaced with
> > .drained_begin/end/poll() callbacks that have been added to BdrvChildClass 
> > and
> > BlockDevOps.
> > 
> > This newer .bdrained_begin/end/poll() approach is attractive because it 
> > works
> > without specifying a specific AioContext. The block layer is moving towards
> > multi-queue and that means multiple AioContexts may be processing I/O
> > simultaneously.
> > 
> > The aio_disable_external() was always somewhat hacky. It suspends all file
> > descriptors that were registered with is_external=true, even if they have
> > nothing to do with the BlockDriverState graph nodes that are being drained.
> > It's better to solve a block layer problem in the block layer than to have 
> > an
> > odd event loop API solution.
> > 
> > That covers the motivation for this change, now on to the specifics of this
> > series:
> > 
> > While it would be nice if a single conceptual approach could be applied to 
> > all
> > is_external=true file descriptors, I ended up looking at callers on a
> > case-by-case basis. There are two general ways I migrated code away from
> > is_external=true:
> > 
> > 1. Block exports are typically best off unregistering fds in 
> > .drained_begin()
> > and registering them again in .drained_end(). The .drained_poll() 
> > function
> > waits for in-flight requests to finish using a reference counter.
> > 
> > 2. Emulated storage controllers like virtio-blk and virtio-scsi are a little
> > simpler. They can rely on BlockBackend's request queuing during drain
> > feature. Guest I/O request coroutines are suspended in a drained 
> > section and
> > resume upon the end of the drained section.
> 
> Sorry, I disagree with this.
> 
> Request queuing was shown to cause deadlocks; Hanna's latest patch is piling
> another hack upon it, instead in my opinion we should go in the direction of
> relying _less_ (or not at all) on request queuing.
> 
> I am strongly convinced that request queuing must apply only after
> bdrv_drained_begin has returned, which would also fix the IDE TRIM bug
> reported by Fiona Ebner.  The possible livelock scenario is generally not a
> problem because 1) outside an iothread you have anyway the BQL that prevents
> a vCPU from issuing more I/O operations during bdrv_drained_begin 2) in
> iothreads you have aio_disable_external() instead of .drained_begin().
> 
> It is also less tidy to start a request during the drained_begin phase,
> because a request that has been submitted has to be completed (cancel
> doesn't really work).
> 
> So in an ideal world, request queuing would not only apply only after
> bdrv_drained_begin has returned, it would log a warning and .drained_begin()
> should set up things so that there are no such warnings.

That's fine, I will give .drained_begin/end/poll() a try with virtio-blk
and virtio-scsi in the next revision.

Stefan


signature.asc
Description: PGP signature

Re: Enabling hypervisor agnosticism for VirtIO backends

2021-08-17 Thread Stefan Hajnoczi

On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > Could we consider the kernel internally converting IOREQ messages from
> > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > hypercall interfaces?
> > 
> > So any thoughts on what directions are worth experimenting with?
>  
> One option we should consider is for each backend to connect to Xen via
> the IOREQ interface. We could generalize the IOREQ interface and make it
> hypervisor agnostic. The interface is really trivial and easy to add.
> The only Xen-specific part is the notification mechanism, which is an
> event channel. If we replaced the event channel with something else the
> interface would be generic. See:
> https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52

There have been experiments with something kind of similar in KVM
recently (see struct ioregionfd_cmd):
https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanas...@gmail.com/

> There is also another problem. IOREQ is probably not be the only
> interface needed. Have a look at
> https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> an interface for the backend to inject interrupts into the frontend? And
> if the backend requires dynamic memory mappings of frontend pages, then
> we would also need an interface to map/unmap domU pages.
> 
> These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> and self-contained. It is easy to add anywhere. A new interface to
> inject interrupts or map pages is more difficult to manage because it
> would require changes scattered across the various emulators.

Something like ioreq is indeed necessary to implement arbitrary devices,
but if you are willing to restrict yourself to VIRTIO then other
interfaces are possible too because the VIRTIO device model is different
from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.

Stefan


signature.asc
Description: PGP signature

Re: Enabling hypervisor agnosticism for VirtIO backends

2021-08-23 Thread Stefan Hajnoczi

On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> Hi Stefan,
> 
> On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > Could we consider the kernel internally converting IOREQ messages from
> > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > hypercall interfaces?
> > > > 
> > > > So any thoughts on what directions are worth experimenting with?
> > >  
> > > One option we should consider is for each backend to connect to Xen via
> > > the IOREQ interface. We could generalize the IOREQ interface and make it
> > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > The only Xen-specific part is the notification mechanism, which is an
> > > event channel. If we replaced the event channel with something else the
> > > interface would be generic. See:
> > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > 
> > There have been experiments with something kind of similar in KVM
> > recently (see struct ioregionfd_cmd):
> > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanas...@gmail.com/
> 
> Do you know the current status of Elena's work?
> It was last February that she posted her latest patch
> and it has not been merged upstream yet.

Elena worked on this during her Outreachy internship. At the moment no
one is actively working on the patches.

> > > There is also another problem. IOREQ is probably not be the only
> > > interface needed. Have a look at
> > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also need
> > > an interface for the backend to inject interrupts into the frontend? And
> > > if the backend requires dynamic memory mappings of frontend pages, then
> > > we would also need an interface to map/unmap domU pages.
> > > 
> > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > and self-contained. It is easy to add anywhere. A new interface to
> > > inject interrupts or map pages is more difficult to manage because it
> > > would require changes scattered across the various emulators.
> > 
> > Something like ioreq is indeed necessary to implement arbitrary devices,
> > but if you are willing to restrict yourself to VIRTIO then other
> > interfaces are possible too because the VIRTIO device model is different
> > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> 
> Can you please elaborate your thoughts a bit more here?
> 
> It seems to me that trapping MMIOs to configuration space and
> forwarding those events to BE (or device emulation) is a quite
> straight-forward way to emulate device MMIOs.
> Or do you think of something of protocols used in vhost-user?
> 
> # On the contrary, virtio-ivshmem only requires a driver to explicitly
> # forward a "write" request of MMIO accesses to BE. But I don't think
> # it's your point. 

See my first reply to this email thread about alternative interfaces for
VIRTIO device emulation. The main thing to note was that although the
shared memory vring is used by VIRTIO transports today, the device model
actually allows transports to implement virtqueues differently (e.g.
making it possible to create a VIRTIO over TCP transport without shared
memory in the future).

It's possible to define a hypercall interface as a new VIRTIO transport
that provides higher-level virtqueue operations. Doing this is more work
than using vrings though since existing guest driver and device
emulation code already supports vrings.

I don't know the requirements of Stratos so I can't say if creating a
new hypervisor-independent interface (VIRTIO transport) that doesn't
rely on shared memory vrings makes sense. I just wanted to raise the
idea in case you find that VIRTIO's vrings don't meet your requirements.

Stefan


signature.asc
Description: PGP signature

Re: Enabling hypervisor agnosticism for VirtIO backends

2021-08-25 Thread Stefan Hajnoczi

On Wed, Aug 25, 2021 at 07:29:45PM +0900, AKASHI Takahiro wrote:
> On Mon, Aug 23, 2021 at 10:58:46AM +0100, Stefan Hajnoczi wrote:
> > On Mon, Aug 23, 2021 at 03:25:00PM +0900, AKASHI Takahiro wrote:
> > > Hi Stefan,
> > > 
> > > On Tue, Aug 17, 2021 at 11:41:01AM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Aug 04, 2021 at 12:20:01PM -0700, Stefano Stabellini wrote:
> > > > > > Could we consider the kernel internally converting IOREQ messages 
> > > > > > from
> > > > > > the Xen hypervisor to eventfd events? Would this scale with other 
> > > > > > kernel
> > > > > > hypercall interfaces?
> > > > > > 
> > > > > > So any thoughts on what directions are worth experimenting with?
> > > > >  
> > > > > One option we should consider is for each backend to connect to Xen 
> > > > > via
> > > > > the IOREQ interface. We could generalize the IOREQ interface and make 
> > > > > it
> > > > > hypervisor agnostic. The interface is really trivial and easy to add.
> > > > > The only Xen-specific part is the notification mechanism, which is an
> > > > > event channel. If we replaced the event channel with something else 
> > > > > the
> > > > > interface would be generic. See:
> > > > > https://gitlab.com/xen-project/xen/-/blob/staging/xen/include/public/hvm/ioreq.h#L52
> > > > 
> > > > There have been experiments with something kind of similar in KVM
> > > > recently (see struct ioregionfd_cmd):
> > > > https://lore.kernel.org/kvm/dad3d025bcf15ece11d9df0ff685e8ab0a4f2edd.1613828727.git.eafanas...@gmail.com/
> > > 
> > > Do you know the current status of Elena's work?
> > > It was last February that she posted her latest patch
> > > and it has not been merged upstream yet.
> > 
> > Elena worked on this during her Outreachy internship. At the moment no
> > one is actively working on the patches.
> 
> Does RedHat plan to take over or follow up her work hereafter?
> # I'm simply asking from my curiosity.

At the moment I'm not aware of anyone from Red Hat working on it. If
someone decides they need this KVM API then that could change.

> > > > > There is also another problem. IOREQ is probably not be the only
> > > > > interface needed. Have a look at
> > > > > https://marc.info/?l=xen-devel&m=162373754705233&w=2. Don't we also 
> > > > > need
> > > > > an interface for the backend to inject interrupts into the frontend? 
> > > > > And
> > > > > if the backend requires dynamic memory mappings of frontend pages, 
> > > > > then
> > > > > we would also need an interface to map/unmap domU pages.
> > > > > 
> > > > > These interfaces are a lot more problematic than IOREQ: IOREQ is tiny
> > > > > and self-contained. It is easy to add anywhere. A new interface to
> > > > > inject interrupts or map pages is more difficult to manage because it
> > > > > would require changes scattered across the various emulators.
> > > > 
> > > > Something like ioreq is indeed necessary to implement arbitrary devices,
> > > > but if you are willing to restrict yourself to VIRTIO then other
> > > > interfaces are possible too because the VIRTIO device model is different
> > > > from the general purpose x86 PIO/MMIO that Xen's ioreq seems to support.
> > > 
> > > Can you please elaborate your thoughts a bit more here?
> > > 
> > > It seems to me that trapping MMIOs to configuration space and
> > > forwarding those events to BE (or device emulation) is a quite
> > > straight-forward way to emulate device MMIOs.
> > > Or do you think of something of protocols used in vhost-user?
> > > 
> > > # On the contrary, virtio-ivshmem only requires a driver to explicitly
> > > # forward a "write" request of MMIO accesses to BE. But I don't think
> > > # it's your point. 
> > 
> > See my first reply to this email thread about alternative interfaces for
> > VIRTIO device emulation. The main thing to note was that although the
> > shared memory vring is used by VIRTIO transports today, the device model
> > actually allows transports to implement virtqueues differently (e.g.
> > making it possible to create a VIRTIO over TCP transport without shared
> > memory i

Re: [PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-11 Thread Stefan Hajnoczi

On Tue, Jun 11, 2024 at 07:19:08AM +0200, Christoph Hellwig wrote:
> virtblk_update_cache_mode boils down to a single call to
> blk_queue_write_cache.  Remove it in preparation for moving the cache
> control flags into the queue_limits.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/virtio_blk.c | 13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH v3 0/6] block: add blk_io_plug_call() API

2023-05-31 Thread Stefan Hajnoczi

Hi Kevin,
Do you want to review the thread-local blk_io_plug() patch series or
should I merge it?

Thanks,
Stefan


signature.asc
Description: PGP signature

Re: [PATCH v3 0/6] block: add blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi

On Tue, May 30, 2023 at 02:09:53PM -0400, Stefan Hajnoczi wrote:
> v3
> - Patch 5: Mention why dev_max_batch condition was dropped [Stefano]
> v2
> - Patch 1: "is not be freed" -> "is not freed" [Eric]
> - Patch 2: Remove unused nvme_process_completion_queue_plugged trace event
>   [Stefano]
> - Patch 3: Add missing #include and fix blkio_unplug_fn() prototype [Stefano]
> - Patch 4: Removed whitespace hunk [Eric]
> 
> The existing blk_io_plug() API is not block layer multi-queue friendly because
> the plug state is per-BlockDriverState.
> 
> Change blk_io_plug()'s implementation so it is thread-local. This is done by
> introducing the blk_io_plug_call() function that block drivers use to batch
> calls while plugged. It is relatively easy to convert block drivers from
> .bdrv_co_io_plug() to blk_io_plug_call().
> 
> Random read 4KB performance with virtio-blk on a host NVMe block device:
> 
> iodepth   iops   change vs today
> 145612   -4%
> 287967   +2%
> 4   129872   +0%
> 8   171096   -3%
> 16  194508   -4%
> 32  208947   -1%
> 64  217647   +0%
> 128 229629   +0%
> 
> The results are within the noise for these benchmarks. This is to be expected
> because the plugging behavior for a single thread hasn't changed in this patch
> series, only that the state is thread-local now.
> 
> The following graph compares several approaches:
> https://vmsplice.net/~stefan/blk_io_plug-thread-local.png
> - v7.2.0: before most of the multi-queue block layer changes landed.
> - with-blk_io_plug: today's post-8.0.0 QEMU.
> - blk_io_plug-thread-local: this patch series.
> - no-blk_io_plug: what happens when we simply remove plugging?
> - call-after-dispatch: what if we integrate plugging into the event loop? I
>   decided against this approach in the end because it's more likely to
>   introduce performance regressions since I/O submission is deferred until the
>   end of the event loop iteration.
> 
> Aside from the no-blk_io_plug case, which bottlenecks much earlier than the
> others, we see that all plugging approaches are more or less equivalent in 
> this
> benchmark. It is also clear that QEMU 8.0.0 has lower performance than 7.2.0.
> 
> The Ansible playbook, fio results, and a Jupyter notebook are available here:
> https://github.com/stefanha/qemu-perf/tree/remove-blk_io_plug
> 
> Stefan Hajnoczi (6):
>   block: add blk_io_plug_call() API
>   block/nvme: convert to blk_io_plug_call() API
>   block/blkio: convert to blk_io_plug_call() API
>   block/io_uring: convert to blk_io_plug_call() API
>   block/linux-aio: convert to blk_io_plug_call() API
>   block: remove bdrv_co_io_plug() API
> 
>  MAINTAINERS   |   1 +
>  include/block/block-io.h  |   3 -
>  include/block/block_int-common.h  |  11 ---
>  include/block/raw-aio.h   |  14 ---
>  include/sysemu/block-backend-io.h |  13 +--
>  block/blkio.c |  43 
>  block/block-backend.c |  22 -
>  block/file-posix.c|  38 ---
>  block/io.c|  37 ---
>  block/io_uring.c  |  44 -
>  block/linux-aio.c |  41 +++-
>  block/nvme.c  |  44 +++--
>  block/plug.c  | 159 ++
>  hw/block/dataplane/xen-block.c|   8 +-
>  hw/block/virtio-blk.c |   4 +-
>  hw/scsi/virtio-scsi.c |   6 +-
>  block/meson.build |   1 +
>  block/trace-events|   6 +-
>  18 files changed, 239 insertions(+), 256 deletions(-)
>  create mode 100644 block/plug.c
> 
> -- 
> 2.40.1
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

[PULL 0/8] Block patches

2023-06-01 Thread Stefan Hajnoczi

The following changes since commit c6a5fc2ac76c5ab709896ee1b0edd33685a67ed1:

  decodetree: Add --output-null for meson testing (2023-05-31 19:56:42 -0700)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 98b126f5e3228a346c774e569e26689943b401dd:

  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa (2023-06-01 
11:08:21 -0400)


Pull request

- Stefano Garzarella's blkio block driver 'fd' parameter
- My thread-local blk_io_plug() series

----

Stefan Hajnoczi (6):
  block: add blk_io_plug_call() API
  block/nvme: convert to blk_io_plug_call() API
  block/blkio: convert to blk_io_plug_call() API
  block/io_uring: convert to blk_io_plug_call() API
  block/linux-aio: convert to blk_io_plug_call() API
  block: remove bdrv_co_io_plug() API

Stefano Garzarella (2):
  block/blkio: use qemu_open() to support fd passing for virtio-blk
  qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

 MAINTAINERS   |   1 +
 qapi/block-core.json  |   6 ++
 meson.build   |   4 +
 include/block/block-io.h  |   3 -
 include/block/block_int-common.h  |  11 ---
 include/block/raw-aio.h   |  14 ---
 include/sysemu/block-backend-io.h |  13 +--
 block/blkio.c |  96 --
 block/block-backend.c |  22 -
 block/file-posix.c|  38 ---
 block/io.c|  37 ---
 block/io_uring.c  |  44 -
 block/linux-aio.c |  41 +++-
 block/nvme.c  |  44 +++--
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 block/trace-events|   6 +-
 20 files changed, 293 insertions(+), 265 deletions(-)
 create mode 100644 block/plug.c

-- 
2.40.1

[PULL 4/8] block/io_uring: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-5-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 10 --
 block/io_uring.c| 44 -
 block/trace-events  |  5 ++---
 4 files changed, 19 insertions(+), 47 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0fe85ade77..da60ca13ef 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -81,13 +81,6 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, int 
fd, uint64_t offset,
   QEMUIOVector *qiov, int type);
 void luring_detach_aio_context(LuringState *s, AioContext *old_context);
 void luring_attach_aio_context(LuringState *s, AioContext *new_context);
-
-/*
- * luring_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void luring_io_plug(void);
-void luring_io_unplug(void);
 #endif
 
 #ifdef _WIN32
diff --git a/block/file-posix.c b/block/file-posix.c
index 0ab158efba..7baa8491dd 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2558,11 +2558,6 @@ static void coroutine_fn raw_co_io_plug(BlockDriverState 
*bs)
 laio_io_plug();
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_plug();
-}
-#endif
 }
 
 static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
@@ -2573,11 +2568,6 @@ static void coroutine_fn 
raw_co_io_unplug(BlockDriverState *bs)
 laio_io_unplug(s->aio_max_batch);
 }
 #endif
-#ifdef CONFIG_LINUX_IO_URING
-if (s->use_linux_io_uring) {
-luring_io_unplug();
-}
-#endif
 }
 
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
diff --git a/block/io_uring.c b/block/io_uring.c
index 3a77480e16..69d9820928 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -16,6 +16,7 @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 #include "trace.h"
 
 /* Only used for assertions.  */
@@ -41,7 +42,6 @@ typedef struct LuringAIOCB {
 } LuringAIOCB;
 
 typedef struct LuringQueue {
-int plugged;
 unsigned int in_queue;
 unsigned int in_flight;
 bool blocked;
@@ -267,7 +267,7 @@ static void 
luring_process_completions_and_submit(LuringState *s)
 {
 luring_process_completions(s);
 
-if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+if (s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -301,29 +301,17 @@ static void qemu_luring_poll_ready(void *opaque)
 static void ioq_init(LuringQueue *io_q)
 {
 QSIMPLEQ_INIT(&io_q->submit_queue);
-io_q->plugged = 0;
 io_q->in_queue = 0;
 io_q->in_flight = 0;
 io_q->blocked = false;
 }
 
-void luring_io_plug(void)
+static void luring_unplug_fn(void *opaque)
 {
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-trace_luring_io_plug(s);
-s->io_q.plugged++;
-}
-
-void luring_io_unplug(void)
-{
-AioContext *ctx = qemu_get_current_aio_context();
-LuringState *s = aio_get_linux_io_uring(ctx);
-assert(s->io_q.plugged);
-trace_luring_io_unplug(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (--s->io_q.plugged == 0 &&
-!s->io_q.blocked && s->io_q.in_queue > 0) {
+LuringState *s = opaque;
+trace_luring_unplug_fn(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked && s->io_q.in_queue > 0) {
 ioq_submit(s);
 }
 }
@@ -370,14 +358,16 @@ static int luring_do_submit(int fd, LuringAIOCB 
*luringcb, LuringState *s,
 
 QSIMPLEQ_INSERT_TAIL(&s->io_q.submit_queue, luringcb, next);
 s->io_q.in_queue++;
-trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
-   s->io_q.in_queue, s->io_q.in_flight);
-if (!s->io_q.blocked &&
-(!s->io_q.plugged ||
- s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES)) {
-ret = ioq_submit(s);
-trace_luring_do_submit_done(s, ret);
-return ret;
+trace_luring_do_submit(s, s->io_q.blocked, s->io_q.in_queue,
+   s->io_q.in_flight);
+if (!s->io_q.blocked) {
+if (s->io_q.in_flight + s->io_q.in_queue >= MAX_ENTRIES) {
+ret = ioq_submit(s)

[PULL 3/8] block/blkio: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-4-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 72117fa005..11be8787a3 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -17,6 +17,7 @@
 #include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "sysemu/block-backend.h"
 #include "exec/memory.h" /* for ram_block_discard_disable() */
 
 #include "block/block-io.h"
@@ -320,16 +321,30 @@ static void blkio_detach_aio_context(BlockDriverState *bs)
NULL, NULL, NULL);
 }
 
-/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
-static void blkio_submit_io(BlockDriverState *bs)
+/*
+ * Called by blk_io_unplug() or immediately if not plugged. Called without
+ * blkio_lock.
+ */
+static void blkio_unplug_fn(void *opaque)
 {
-if (qatomic_read(&bs->io_plugged) == 0) {
-BDRVBlkioState *s = bs->opaque;
+BDRVBlkioState *s = opaque;
 
+WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
 }
 }
 
+/*
+ * Schedule I/O submission after enqueuing a new request. Called without
+ * blkio_lock.
+ */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+BDRVBlkioState *s = bs->opaque;
+
+blk_io_plug_call(blkio_unplug_fn, s);
+}
+
 static int coroutine_fn
 blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
@@ -340,9 +355,9 @@ blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -373,9 +388,9 @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, 
int64_t bytes,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -418,9 +433,9 @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 
 if (use_bounce_buffer) {
@@ -439,9 +454,9 @@ static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_flush(s->blkioq, &cod, 0);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
@@ -467,22 +482,13 @@ static int coroutine_fn 
blkio_co_pwrite_zeroes(BlockDriverState *bs,
 
 WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
 blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
-blkio_submit_io(bs);
 }
 
+blkio_submit_io(bs);
 qemu_coroutine_yield();
 return cod.ret;
 }
 
-static void coroutine_fn blkio_co_io_unplug(BlockDriverState *bs)
-{
-BDRVBlkioState *s = bs->opaque;
-
-WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
-blkio_submit_io(bs);
-}
-}
-
 typedef enum {
 BMRR_OK,
 BMRR_SKIP,
@@ -1004,7 +1010,6 @@ static void blkio_refresh_limits(BlockDriverState *bs, 
Error **errp)
 .bdrv_co_pwritev = blkio_co_pwritev, \
 .bdrv_co_flush_to_disk   = blkio_co_flush, \
 .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
-.bdrv_co_io_unplug   = blkio_co_io_unplug, \
 .bdrv_refresh_limits = blkio_refresh_limits, \
 .bdrv_register_buf   = blkio_register_buf, \
 .bdrv_unregister_buf = blkio_unregister_buf, \
-- 
2.40.1

[PULL 7/8] block/blkio: use qemu_open() to support fd passing for virtio-blk

2023-06-01 Thread Stefan Hajnoczi

From: Stefano Garzarella 

Some virtio-blk drivers (e.g. virtio-blk-vhost-vdpa) supports the fd
passing. Let's expose this to the user, so the management layer
can pass the file descriptor of an already opened path.

If the libblkio virtio-blk driver supports fd passing, let's always
use qemu_open() to open the `path`, so we can handle fd passing
from the management layer through the "/dev/fdset/N" special path.

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
Message-id: 20230530071941.8954-2-sgarz...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/blkio.c | 53 ++-
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 11be8787a3..527323d625 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -673,25 +673,60 @@ static int blkio_virtio_blk_common_open(BlockDriverState 
*bs,
 {
 const char *path = qdict_get_try_str(options, "path");
 BDRVBlkioState *s = bs->opaque;
-int ret;
+bool fd_supported = false;
+int fd, ret;
 
 if (!path) {
 error_setg(errp, "missing 'path' option");
 return -EINVAL;
 }
 
-ret = blkio_set_str(s->blkio, "path", path);
-qdict_del(options, "path");
-if (ret < 0) {
-error_setg_errno(errp, -ret, "failed to set path: %s",
- blkio_get_error_msg());
-return ret;
-}
-
 if (!(flags & BDRV_O_NOCACHE)) {
 error_setg(errp, "cache.direct=off is not supported");
 return -EINVAL;
 }
+
+if (blkio_get_int(s->blkio, "fd", &fd) == 0) {
+fd_supported = true;
+}
+
+/*
+ * If the libblkio driver supports fd passing, let's always use qemu_open()
+ * to open the `path`, so we can handle fd passing from the management
+ * layer through the "/dev/fdset/N" special path.
+ */
+if (fd_supported) {
+int open_flags;
+
+if (flags & BDRV_O_RDWR) {
+open_flags = O_RDWR;
+} else {
+open_flags = O_RDONLY;
+}
+
+fd = qemu_open(path, open_flags, errp);
+if (fd < 0) {
+return -EINVAL;
+}
+
+ret = blkio_set_int(s->blkio, "fd", fd);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "failed to set fd: %s",
+ blkio_get_error_msg());
+qemu_close(fd);
+return ret;
+}
+} else {
+ret = blkio_set_str(s->blkio, "path", path);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "failed to set path: %s",
+ blkio_get_error_msg());
+return ret;
+}
+}
+
+qdict_del(options, "path");
+
 return 0;
 }
 
-- 
2.40.1

[PULL 2/8] block/nvme: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-3-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/nvme.c   | 44 
 block/trace-events |  1 -
 2 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 17937d398d..7ca85bc44a 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -25,6 +25,7 @@
 #include "qemu/vfio-helpers.h"
 #include "block/block-io.h"
 #include "block/block_int.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/replay.h"
 #include "trace.h"
 
@@ -119,7 +120,6 @@ struct BDRVNVMeState {
 int blkshift;
 
 uint64_t max_transfer;
-bool plugged;
 
 bool supports_write_zeroes;
 bool supports_discard;
@@ -282,7 +282,7 @@ static void nvme_kick(NVMeQueuePair *q)
 {
 BDRVNVMeState *s = q->s;
 
-if (s->plugged || !q->need_kick) {
+if (!q->need_kick) {
 return;
 }
 trace_nvme_kick(s, q->index);
@@ -387,10 +387,6 @@ static bool nvme_process_completion(NVMeQueuePair *q)
 NvmeCqe *c;
 
 trace_nvme_process_completion(s, q->index, q->inflight);
-if (s->plugged) {
-trace_nvme_process_completion_queue_plugged(s, q->index);
-return false;
-}
 
 /*
  * Support re-entrancy when a request cb() function invokes aio_poll().
@@ -480,6 +476,15 @@ static void nvme_trace_command(const NvmeCmd *cmd)
 }
 }
 
+static void nvme_unplug_fn(void *opaque)
+{
+NVMeQueuePair *q = opaque;
+
+QEMU_LOCK_GUARD(&q->lock);
+nvme_kick(q);
+nvme_process_completion(q);
+}
+
 static void nvme_submit_command(NVMeQueuePair *q, NVMeRequest *req,
 NvmeCmd *cmd, BlockCompletionFunc cb,
 void *opaque)
@@ -496,8 +501,7 @@ static void nvme_submit_command(NVMeQueuePair *q, 
NVMeRequest *req,
q->sq.tail * NVME_SQ_ENTRY_BYTES, cmd, sizeof(*cmd));
 q->sq.tail = (q->sq.tail + 1) % NVME_QUEUE_SIZE;
 q->need_kick++;
-nvme_kick(q);
-nvme_process_completion(q);
+blk_io_plug_call(nvme_unplug_fn, q);
 qemu_mutex_unlock(&q->lock);
 }
 
@@ -1567,27 +1571,6 @@ static void nvme_attach_aio_context(BlockDriverState *bs,
 }
 }
 
-static void coroutine_fn nvme_co_io_plug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(!s->plugged);
-s->plugged = true;
-}
-
-static void coroutine_fn nvme_co_io_unplug(BlockDriverState *bs)
-{
-BDRVNVMeState *s = bs->opaque;
-assert(s->plugged);
-s->plugged = false;
-for (unsigned i = INDEX_IO(0); i < s->queue_count; i++) {
-NVMeQueuePair *q = s->queues[i];
-qemu_mutex_lock(&q->lock);
-nvme_kick(q);
-nvme_process_completion(q);
-qemu_mutex_unlock(&q->lock);
-}
-}
-
 static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
   Error **errp)
 {
@@ -1664,9 +1647,6 @@ static BlockDriver bdrv_nvme = {
 .bdrv_detach_aio_context  = nvme_detach_aio_context,
 .bdrv_attach_aio_context  = nvme_attach_aio_context,
 
-.bdrv_co_io_plug  = nvme_co_io_plug,
-.bdrv_co_io_unplug= nvme_co_io_unplug,
-
 .bdrv_register_buf= nvme_register_buf,
 .bdrv_unregister_buf  = nvme_unregister_buf,
 };
diff --git a/block/trace-events b/block/trace-events
index 32665158d6..048ad27519 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -141,7 +141,6 @@ nvme_kick(void *s, unsigned q_index) "s %p q #%u"
 nvme_dma_flush_queue_wait(void *s) "s %p"
 nvme_error(int cmd_specific, int sq_head, int sqid, int cid, int status) 
"cmd_specific %d sq_head %d sqid %d cid %d status 0x%x"
 nvme_process_completion(void *s, unsigned q_index, int inflight) "s %p q #%u 
inflight %d"
-nvme_process_completion_queue_plugged(void *s, unsigned q_index) "s %p q #%u"
 nvme_complete_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command(void *s, unsigned q_index, int cid) "s %p q #%u cid %d"
 nvme_submit_command_raw(int c0, int c1, int c2, int c3, int c4, int c5, int 
c6, int c7) "%02x %02x %02x %02x %02x %02x %02x %02x"
-- 
2.40.1

[PULL 6/8] block: remove bdrv_co_io_plug() API

2023-06-01 Thread Stefan Hajnoczi

No block driver implements .bdrv_co_io_plug() anymore. Get rid of the
function pointers.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-7-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/block-io.h |  3 ---
 include/block/block_int-common.h | 11 --
 block/io.c   | 37 
 3 files changed, 51 deletions(-)

diff --git a/include/block/block-io.h b/include/block/block-io.h
index a27e471a87..43af816d75 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -259,9 +259,6 @@ void coroutine_fn bdrv_co_leave(BlockDriverState *bs, 
AioContext *old_ctx);
 
 AioContext *child_of_bds_get_parent_aio_context(BdrvChild *c);
 
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_plug(BlockDriverState *bs);
-void coroutine_fn GRAPH_RDLOCK bdrv_co_io_unplug(BlockDriverState *bs);
-
 bool coroutine_fn GRAPH_RDLOCK
 bdrv_co_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
uint32_t granularity, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index b1cbc1e00c..74195c3004 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -768,11 +768,6 @@ struct BlockDriver {
 void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_debug_event)(
 BlockDriverState *bs, BlkdebugEvent event);
 
-/* io queue for linux-aio */
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_plug)(BlockDriverState 
*bs);
-void coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_io_unplug)(
-BlockDriverState *bs);
-
 bool (*bdrv_supports_persistent_dirty_bitmap)(BlockDriverState *bs);
 
 bool coroutine_fn GRAPH_RDLOCK_PTR (*bdrv_co_can_store_new_dirty_bitmap)(
@@ -1227,12 +1222,6 @@ struct BlockDriverState {
 unsigned int in_flight;
 unsigned int serialising_in_flight;
 
-/*
- * counter for nested bdrv_io_plug.
- * Accessed with atomic ops.
- */
-unsigned io_plugged;
-
 /* do we need to tell the quest if we have a volatile write cache? */
 int enable_write_cache;
 
diff --git a/block/io.c b/block/io.c
index 540bf8d26d..f2dfc7c405 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3223,43 +3223,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-void coroutine_fn bdrv_co_io_plug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_plug(child->bs);
-}
-
-if (qatomic_fetch_inc(&bs->io_plugged) == 0) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_plug) {
-drv->bdrv_co_io_plug(bs);
-}
-}
-}
-
-void coroutine_fn bdrv_co_io_unplug(BlockDriverState *bs)
-{
-BdrvChild *child;
-IO_CODE();
-assert_bdrv_graph_readable();
-
-assert(bs->io_plugged);
-if (qatomic_fetch_dec(&bs->io_plugged) == 1) {
-BlockDriver *drv = bs->drv;
-if (drv && drv->bdrv_co_io_unplug) {
-drv->bdrv_co_io_unplug(bs);
-}
-}
-
-QLIST_FOREACH(child, &bs->children, next) {
-bdrv_co_io_unplug(child->bs);
-}
-}
-
 /* Helper that undoes bdrv_register_buf() when it fails partway through */
 static void GRAPH_RDLOCK
 bdrv_register_buf_rollback(BlockDriverState *bs, void *host, size_t size,
-- 
2.40.1

[PULL 5/8] block/linux-aio: convert to blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi

Stop using the .bdrv_co_io_plug() API because it is not multi-queue
block layer friendly. Use the new blk_io_plug_call() API to batch I/O
submission instead.

Note that a dev_max_batch check is dropped in laio_io_unplug() because
the semantics of unplug_fn() are different from .bdrv_co_unplug():
1. unplug_fn() is only called when the last blk_io_unplug() call occurs,
   not every time blk_io_unplug() is called.
2. unplug_fn() is per-thread, not per-BlockDriverState, so there is no
   way to get per-BlockDriverState fields like dev_max_batch.

Therefore this condition cannot be moved to laio_unplug_fn(). It is not
obvious that this condition affects performance in practice, so I am
removing it instead of trying to come up with a more complex mechanism
to preserve the condition.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Acked-by: Kevin Wolf 
Reviewed-by: Stefano Garzarella 
Message-id: 20230530180959.1108766-6-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/block/raw-aio.h |  7 ---
 block/file-posix.c  | 28 
 block/linux-aio.c   | 41 +++--
 3 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index da60ca13ef..0f63c2800c 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -62,13 +62,6 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, 
QEMUIOVector *qiov,
 
 void laio_detach_aio_context(LinuxAioState *s, AioContext *old_context);
 void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
-
-/*
- * laio_io_plug/unplug work in the thread's current AioContext, therefore the
- * caller must ensure that they are paired in the same IOThread.
- */
-void laio_io_plug(void);
-void laio_io_unplug(uint64_t dev_max_batch);
 #endif
 /* io_uring.c - Linux io_uring implementation */
 #ifdef CONFIG_LINUX_IO_URING
diff --git a/block/file-posix.c b/block/file-posix.c
index 7baa8491dd..ac1ed54811 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2550,26 +2550,6 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState 
*bs, int64_t offset,
 return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
-static void coroutine_fn raw_co_io_plug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_plug();
-}
-#endif
-}
-
-static void coroutine_fn raw_co_io_unplug(BlockDriverState *bs)
-{
-BDRVRawState __attribute__((unused)) *s = bs->opaque;
-#ifdef CONFIG_LINUX_AIO
-if (s->use_linux_aio) {
-laio_io_unplug(s->aio_max_batch);
-}
-#endif
-}
-
 static int coroutine_fn raw_co_flush_to_disk(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
@@ -3914,8 +3894,6 @@ BlockDriver bdrv_file = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4286,8 +4264,6 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
 .bdrv_refresh_limits = raw_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4424,8 +4400,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
@@ -4552,8 +4526,6 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
 .bdrv_refresh_limits= cdrom_refresh_limits,
-.bdrv_co_io_plug= raw_co_io_plug,
-.bdrv_co_io_unplug  = raw_co_io_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
 
 .bdrv_co_truncate   = raw_co_truncate,
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 916f001e32..561c71a9ae 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -15,6 +15,7 @@
 #include "qemu/event_notifier.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "sysemu/block-backend.h"
 
 /* Only used for assertions.  */
 #include "qemu/coroutine_int.h"
@@ -46,7 +47,6 @@ struct qemu_laio

[PULL 8/8] qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa

2023-06-01 Thread Stefan Hajnoczi

From: Stefano Garzarella 

The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
passing through the new 'fd' property.

Since now we are using qemu_open() on '@path' if the virtio-blk driver
supports the fd passing, let's announce it.
In this way, the management layer can pass the file descriptor of an
already opened vhost-vdpa character device. This is useful especially
when the device can only be accessed with certain privileges.

Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
in libblkio supports it.

Suggested-by: Markus Armbruster 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Stefano Garzarella 
Message-id: 20230530071941.8954-3-sgarz...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 qapi/block-core.json | 6 ++
 meson.build  | 4 
 2 files changed, 10 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 98d9116dae..4bf89171c6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3955,10 +3955,16 @@
 #
 # @path: path to the vhost-vdpa character device.
 #
+# Features:
+# @fdset: Member @path supports the special "/dev/fdset/N" path
+# (since 8.1)
+#
 # Since: 7.2
 ##
 { 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
   'data': { 'path': 'str' },
+  'features': [ { 'name' :'fdset',
+  'if': 'CONFIG_BLKIO_VHOST_VDPA_FD' } ],
   'if': 'CONFIG_BLKIO' }
 
 ##
diff --git a/meson.build b/meson.build
index bc76ea96bf..a61d3e9b06 100644
--- a/meson.build
+++ b/meson.build
@@ -2106,6 +2106,10 @@ config_host_data.set('CONFIG_LZO', lzo.found())
 config_host_data.set('CONFIG_MPATH', mpathpersist.found())
 config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
 config_host_data.set('CONFIG_BLKIO', blkio.found())
+if blkio.found()
+  config_host_data.set('CONFIG_BLKIO_VHOST_VDPA_FD',
+   blkio.version().version_compare('>=1.3.0'))
+endif
 config_host_data.set('CONFIG_CURL', curl.found())
 config_host_data.set('CONFIG_CURSES', curses.found())
 config_host_data.set('CONFIG_GBM', gbm.found())
-- 
2.40.1

[PULL 1/8] block: add blk_io_plug_call() API

2023-06-01 Thread Stefan Hajnoczi

Introduce a new API for thread-local blk_io_plug() that does not
traverse the block graph. The goal is to make blk_io_plug() multi-queue
friendly.

Instead of having block drivers track whether or not we're in a plugged
section, provide an API that allows them to defer a function call until
we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is
called multiple times with the same fn/opaque pair, then fn() is only
called once at the end of the function - resulting in batching.

This patch introduces the API and changes blk_io_plug()/blk_io_unplug().
blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument
because the plug state is now thread-local.

Later patches convert block drivers to blk_io_plug_call() and then we
can finally remove .bdrv_co_io_plug() once all block drivers have been
converted.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Eric Blake 
Reviewed-by: Stefano Garzarella 
Acked-by: Kevin Wolf 
Message-id: 20230530180959.1108766-2-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS   |   1 +
 include/sysemu/block-backend-io.h |  13 +--
 block/block-backend.c |  22 -
 block/plug.c  | 159 ++
 hw/block/dataplane/xen-block.c|   8 +-
 hw/block/virtio-blk.c |   4 +-
 hw/scsi/virtio-scsi.c |   6 +-
 block/meson.build |   1 +
 8 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 block/plug.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4b025a7b63..89f274f85e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2650,6 +2650,7 @@ F: util/aio-*.c
 F: util/aio-*.h
 F: util/fdmon-*.c
 F: block/io.c
+F: block/plug.c
 F: migration/block*
 F: include/block/aio.h
 F: include/block/aio-wait.h
diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index d62a7ee773..be4dcef59d 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -100,16 +100,9 @@ void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
 
-/*
- * blk_io_plug/unplug are thread-local operations. This means that multiple
- * IOThreads can simultaneously call plug/unplug, but the caller must ensure
- * that each unplug() is called in the same IOThread of the matching plug().
- */
-void coroutine_fn blk_co_io_plug(BlockBackend *blk);
-void co_wrapper blk_io_plug(BlockBackend *blk);
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk);
-void co_wrapper blk_io_unplug(BlockBackend *blk);
+void blk_io_plug(void);
+void blk_io_unplug(void);
+void blk_io_plug_call(void (*fn)(void *), void *opaque);
 
 AioContext *blk_get_aio_context(BlockBackend *blk);
 BlockAcctStats *blk_get_stats(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index 241f643507..4009ed5fed 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2582,28 +2582,6 @@ void blk_add_insert_bs_notifier(BlockBackend *blk, 
Notifier *notify)
 notifier_list_add(&blk->insert_bs_notifiers, notify);
 }
 
-void coroutine_fn blk_co_io_plug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_plug(bs);
-}
-}
-
-void coroutine_fn blk_co_io_unplug(BlockBackend *blk)
-{
-BlockDriverState *bs = blk_bs(blk);
-IO_CODE();
-GRAPH_RDLOCK_GUARD();
-
-if (bs) {
-bdrv_co_io_unplug(bs);
-}
-}
-
 BlockAcctStats *blk_get_stats(BlockBackend *blk)
 {
 IO_CODE();
diff --git a/block/plug.c b/block/plug.c
new file mode 100644
index 00..98a155d2f4
--- /dev/null
+++ b/block/plug.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Block I/O plugging
+ *
+ * Copyright Red Hat.
+ *
+ * This API defers a function call within a blk_io_plug()/blk_io_unplug()
+ * section, allowing multiple calls to batch up. This is a performance
+ * optimization that is used in the block layer to submit several I/O requests
+ * at once instead of individually:
+ *
+ *   blk_io_plug(); <-- start of plugged region
+ *   ...
+ *   blk_io_plug_call(my_func, my_obj); <-- deferred my_func(my_obj) call
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   blk_io_plug_call(my_func, my_obj); <-- another
+ *   ...
+ *   blk_io_unplug(); <-- end of plugged region, my_func(my_obj) is called once
+ *
+ * This code is actually generic and not tied to the block layer. If another
+ * subsystem needs this functionality, it could be renamed.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine-tls.h"
+#include "qemu/notify.h"
+#include "qemu/thread.h"
+#include "sysemu/block-backend.h"
+
+/* A function call that has been deferred until unplug() */
+typedef struct {
+void (*fn)(void *);
+void *opaque;
+} UnplugFn;
+
+/* Per-

Re: [PATCH] xen-block: fix segv on unrealize

2023-06-06 Thread Stefan Hajnoczi

Sorry!

Reviewed-by: Stefan Hajnoczi

Re: [RFC PATCH v3 08/78] hw/block: add fallthrough pseudo-keyword

2023-10-16 Thread Stefan Hajnoczi

On Fri, Oct 13, 2023 at 11:45:36AM +0300, Emmanouil Pitsidianakis wrote:
> In preparation of raising -Wimplicit-fallthrough to 5, replace all
> fall-through comments with the fallthrough attribute pseudo-keyword.
> 
> Signed-off-by: Emmanouil Pitsidianakis 
> ---
>  hw/block/dataplane/xen-block.c | 4 ++--
>  hw/block/m25p80.c  | 2 +-
>  hw/block/onenand.c | 2 +-
>  hw/block/pflash_cfi01.c| 1 +
>  hw/block/pflash_cfi02.c| 6 --
>  5 files changed, 9 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PULL 0/7] xenfv-stable queue

2023-11-06 Thread Stefan Hajnoczi

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/8.2 for any 
user-visible changes.


signature.asc
Description: PGP signature

Re: [PULL 00/15] xenfv.for-upstream queue

2023-11-07 Thread Stefan Hajnoczi

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/8.2 for any 
user-visible changes.


signature.asc
Description: PGP signature

[PATCH] block: get rid of blk->guest_block_size

2022-05-18 Thread Stefan Hajnoczi

Commit 1b7fd729559c ("block: rename buffer_alignment to
guest_block_size") noted:

  At this point, the field is set by the device emulation, but completely
  ignored by the block layer.

The last time the value of buffer_alignment/guest_block_size was
actually used was before commit 339064d50639 ("block: Don't use guest
sector size for qemu_blockalign()").

This value has not been used since 2013. Get rid of it.

Cc: Xie Yongji 
Signed-off-by: Stefan Hajnoczi 
---
 include/sysemu/block-backend-io.h|  1 -
 block/block-backend.c| 10 --
 block/export/vhost-user-blk-server.c |  1 -
 hw/block/virtio-blk.c|  1 -
 hw/block/xen-block.c |  1 -
 hw/ide/core.c|  1 -
 hw/scsi/scsi-disk.c  |  1 -
 hw/scsi/scsi-generic.c   |  1 -
 8 files changed, 17 deletions(-)

diff --git a/include/sysemu/block-backend-io.h 
b/include/sysemu/block-backend-io.h
index 6517c39295..ccef514023 100644
--- a/include/sysemu/block-backend-io.h
+++ b/include/sysemu/block-backend-io.h
@@ -72,7 +72,6 @@ void blk_error_action(BlockBackend *blk, BlockErrorAction 
action,
 void blk_iostatus_set_err(BlockBackend *blk, int error);
 int blk_get_max_iov(BlockBackend *blk);
 int blk_get_max_hw_iov(BlockBackend *blk);
-void blk_set_guest_block_size(BlockBackend *blk, int align);
 
 void blk_io_plug(BlockBackend *blk);
 void blk_io_unplug(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index e0e1aff4b1..d4abdf8faa 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -56,9 +56,6 @@ struct BlockBackend {
 const BlockDevOps *dev_ops;
 void *dev_opaque;
 
-/* the block size for which the guest device expects atomicity */
-int guest_block_size;
-
 /* If the BDS tree is removed, some of its options are stored here (which
  * can be used to restore those options in the new BDS on insert) */
 BlockBackendRootState root_state;
@@ -998,7 +995,6 @@ void blk_detach_dev(BlockBackend *blk, DeviceState *dev)
 blk->dev = NULL;
 blk->dev_ops = NULL;
 blk->dev_opaque = NULL;
-blk->guest_block_size = 512;
 blk_set_perm(blk, 0, BLK_PERM_ALL, &error_abort);
 blk_unref(blk);
 }
@@ -2100,12 +2096,6 @@ int blk_get_max_iov(BlockBackend *blk)
 return blk->root->bs->bl.max_iov;
 }
 
-void blk_set_guest_block_size(BlockBackend *blk, int align)
-{
-IO_CODE();
-blk->guest_block_size = align;
-}
-
 void *blk_try_blockalign(BlockBackend *blk, size_t size)
 {
 IO_CODE();
diff --git a/block/export/vhost-user-blk-server.c 
b/block/export/vhost-user-blk-server.c
index a129204c44..b2e458ade3 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -495,7 +495,6 @@ static int vu_blk_exp_create(BlockExport *exp, 
BlockExportOptions *opts,
 return -EINVAL;
 }
 vexp->blk_size = logical_block_size;
-blk_set_guest_block_size(exp->blk, logical_block_size);
 
 if (vu_opts->has_num_queues) {
 num_queues = vu_opts->num_queues;
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cd804795c6..e9ba752f6b 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -1228,7 +1228,6 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
 
 s->change = qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
 blk_set_dev_ops(s->blk, &virtio_block_ops, s);
-blk_set_guest_block_size(s->blk, s->conf.conf.logical_block_size);
 
 blk_iostatus_enable(s->blk);
 
diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index 674953f1ad..345b284d70 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -243,7 +243,6 @@ static void xen_block_realize(XenDevice *xendev, Error 
**errp)
 }
 
 blk_set_dev_ops(blk, &xen_block_dev_ops, blockdev);
-blk_set_guest_block_size(blk, conf->logical_block_size);
 
 if (conf->discard_granularity == -1) {
 conf->discard_granularity = conf->physical_block_size;
diff --git a/hw/ide/core.c b/hw/ide/core.c
index 3a5afff5d7..f7ec68513f 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2544,7 +2544,6 @@ int ide_init_drive(IDEState *s, BlockBackend *blk, 
IDEDriveKind kind,
 s->smart_selftest_count = 0;
 if (kind == IDE_CD) {
 blk_set_dev_ops(blk, &ide_cd_block_ops, s);
-blk_set_guest_block_size(blk, 2048);
 } else {
 if (!blk_is_inserted(s->blk)) {
 error_setg(errp, "Device needs media, but drive is empty");
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 072686ed58..91acb5c0ce 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2419,7 +2419,6 @@ static void scsi_realize(SCSIDevice *dev, Error **errp)
 } else {
 blk_set_dev_ops(s->qdev.conf.blk, &scsi_disk_block_ops, s);
 }
-blk_set_guest_block

[PATCH v2 0/6] aio-posix: split poll check from ready handler

2021-12-02 Thread Stefan Hajnoczi

v2:
- Cleaned up unused return values in nvme and virtio-blk [Stefano]
- Documented try_poll_mode() ready_list argument [Stefano]
- Unified virtio-blk/scsi dataplane and non-dataplane virtqueue handlers 
[Stefano]

The first patch improves AioContext's adaptive polling execution time
measurement. This can result in better performance because the algorithm makes
better decisions about when to poll versus when to fall back to file descriptor
monitoring.

The remaining patches unify the virtio-blk and virtio-scsi dataplane and
non-dataplane virtqueue handlers. This became possible because the dataplane
handler function now has the same function signature as the non-dataplane
handler function. Stefano Garzarella prompted me to make this refactoring.

Stefan Hajnoczi (6):
  aio-posix: split poll check from ready handler
  virtio: get rid of VirtIOHandleAIOOutput
  virtio-blk: drop unused virtio_blk_handle_vq() return value
  virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane
  virtio: use ->handle_output() instead of ->handle_aio_output()
  virtio: unify dataplane and non-dataplane ->handle_output()

 include/block/aio.h |  4 +-
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio.h  |  5 +-
 util/aio-posix.h|  1 +
 block/curl.c| 11 ++--
 block/io_uring.c| 19 ---
 block/iscsi.c   |  4 +-
 block/linux-aio.c   | 16 +++---
 block/nfs.c |  6 +--
 block/nvme.c| 51 ---
 block/ssh.c |  4 +-
 block/win32-aio.c   |  4 +-
 hw/block/dataplane/virtio-blk.c | 16 +-
 hw/block/virtio-blk.c   | 14 ++
 hw/scsi/virtio-scsi-dataplane.c | 60 +++---
 hw/scsi/virtio-scsi.c   |  2 +-
 hw/virtio/virtio.c  | 73 +--
 hw/xen/xen-bus.c|  6 +--
 io/channel-command.c|  6 ++-
 io/channel-file.c   |  3 +-
 io/channel-socket.c |  3 +-
 migration/rdma.c|  8 +--
 tests/unit/test-aio.c   |  4 +-
 util/aio-posix.c| 89 +
 util/aio-win32.c|  4 +-
 util/async.c| 10 +++-
 util/main-loop.c|  4 +-
 util/qemu-coroutine-io.c|  5 +-
 util/vhost-user-server.c| 11 ++--
 29 files changed, 217 insertions(+), 228 deletions(-)

-- 
2.33.1

[PATCH v2 1/6] aio-posix: split poll check from ready handler

2021-12-02 Thread Stefan Hajnoczi

Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 16, iocbpp: 0x7fcb9f937db0)= 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 
4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 32, iocbpp: 0x7fca7d0cebe0)= 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0cdd62be0)= 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0d0388b50)= 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h  |  4 +-
 util/aio-posix.h |  1 +
 block/curl.c | 11 ++---
 block/io_uring.c | 19 +
 block/iscsi.c|  4 +-
 block/linux-aio.c| 16 +---
 block/nfs.c  |  6 +--
 block/nvme.c | 51 +++
 block/ssh.c  |  4 +-
 block/win32-aio.c|  4 +-
 hw/virtio/virtio.c   | 16 +---
 hw/xen/xen-bus.c |  6 +--
 io/channel-command.c |  6 ++-
 io/channel-file.c|  3 +-
 io/channel-socket.c  |  3 +-
 migration/rdma.c |  8 ++--
 tests/unit/test-aio.c|  4 +-
 util/aio-posix.c | 89 ++--
 util/aio-win32.c |  4 +-
 util/async.c | 10 -
 util/main-loop.c |  4 +-
 util/qemu-coroutine-io.c |  5 ++-
 util/vhost-user-server.c | 11 ++---
 23 files changed, 189 insertions(+), 100 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 47fbe9d81f..5634173b12 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -469,6 +469,7 @@ void aio_set_fd_handler(AioContext *ctx,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
+IOHandler *io_poll_ready,
 void *opaque);
 
 /* Set polling begin/end callbacks for a file descriptor that has already been
@@ -490,7 +491,8 @@ void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
 bool is_external,
 EventNotifierHandler *io_read,
-AioPollFn *io_poll);
+AioPollFn *io_poll,
+EventNotifierHandler *io_poll_ready);
 
 /* Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
diff --git a/util/aio-posix.h b/util/aio-posix.h
index c80c04506a..7f2c37a684 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -24,6 +24,7 @@ struct AioHandler {
 IOHandler *io_read;
 IOHandler *io_write;
 AioPollFn *io_poll;
+IOHandler *io_

[PATCH v2 2/6] virtio: get rid of VirtIOHandleAIOOutput

2021-12-02 Thread Stefan Hajnoczi

The virtqueue host notifier API
virtio_queue_aio_set_host_notifier_handler() polls the virtqueue for new
buffers. AioContext previously required a bool progress return value
indicating whether an event was handled or not. This is no longer
necessary because the AioContext polling API has been split into a poll
check function and an event handler function. The event handler is only
run when we know there is work to do, so it doesn't return bool.

The VirtIOHandleAIOOutput function signature is now the same as
VirtIOHandleOutput. Get rid of the bool return value.

Further simplifications will be made for virtio-blk and virtio-scsi in
the next patch.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  3 +--
 hw/block/dataplane/virtio-blk.c |  4 ++--
 hw/scsi/virtio-scsi-dataplane.c | 18 ++
 hw/virtio/virtio.c  | 12 
 4 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..b90095628f 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -175,7 +175,6 @@ void virtio_error(VirtIODevice *vdev, const char *fmt, ...) 
GCC_FMT_ATTR(2, 3);
 void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name);
 
 typedef void (*VirtIOHandleOutput)(VirtIODevice *, VirtQueue *);
-typedef bool (*VirtIOHandleAIOOutput)(VirtIODevice *, VirtQueue *);
 
 VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 VirtIOHandleOutput handle_output);
@@ -318,7 +317,7 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue 
*vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleAIOOutput 
handle_output);
+VirtIOHandleOutput handle_output);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 252c3a7a23..1b50ccd38b 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,7 +154,7 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static bool virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
+static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
@@ -162,7 +162,7 @@ static bool 
virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 assert(s->dataplane);
 assert(s->dataplane_started);
 
-return virtio_blk_handle_vq(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 /* Context: QEMU global mutex held */
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 18eb824c97..76137de67f 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,49 +49,43 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error 
**errp)
 }
 }
 
-static bool virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
   VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_cmd_vq(s, vq);
+virtio_scsi_handle_cmd_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_ctrl_vq(s, vq);
+virtio_scsi_handle_ctrl_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
 VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_event_vq(s, vq);
+virtio_scsi_handle_event_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
diff --

[PATCH v2 3/6] virtio-blk: drop unused virtio_blk_handle_vq() return value

2021-12-02 Thread Stefan Hajnoczi

The return value of virtio_blk_handle_vq() is no longer used. Get rid of
it. This is a step towards unifying the dataplane and non-dataplane
virtqueue handler functions.

Prepare virtio_blk_handle_output() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-blk.h |  2 +-
 hw/block/virtio-blk.c  | 14 +++---
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 29655a406d..d311c57cca 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -90,7 +90,7 @@ typedef struct MultiReqBuffer {
 bool is_write;
 } MultiReqBuffer;
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh);
 
 #endif
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9..82676cdd01 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -767,12 +767,11 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
MultiReqBuffer *mrb)
 return 0;
 }
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 {
 VirtIOBlockReq *req;
 MultiReqBuffer mrb = {};
 bool suppress_notifications = virtio_queue_get_notification(vq);
-bool progress = false;
 
 aio_context_acquire(blk_get_aio_context(s->blk));
 blk_io_plug(s->blk);
@@ -783,7 +782,6 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 }
 
 while ((req = virtio_blk_get_request(s, vq))) {
-progress = true;
 if (virtio_blk_handle_request(req, &mrb)) {
 virtqueue_detach_element(req->vq, &req->elem, 0);
 virtio_blk_free_request(req);
@@ -802,19 +800,13 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 
 blk_io_unplug(s->blk);
 aio_context_release(blk_get_aio_context(s->blk));
-return progress;
-}
-
-static void virtio_blk_handle_output_do(VirtIOBlock *s, VirtQueue *vq)
-{
-virtio_blk_handle_vq(s, vq);
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
 
-if (s->dataplane) {
+if (s->dataplane && !s->dataplane_started) {
 /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
  * dataplane here instead of waiting for .set_status().
  */
@@ -823,7 +815,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
 return;
 }
 }
-virtio_blk_handle_output_do(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
-- 
2.33.1

[PATCH v2 4/6] virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane

2021-12-02 Thread Stefan Hajnoczi

Prepare virtio_scsi_handle_cmd() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/virtio-scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 51fd09522a..34a968ecfb 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -720,7 +720,7 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, 
VirtQueue *vq)
 /* use non-QOM casts in the data path */
 VirtIOSCSI *s = (VirtIOSCSI *)vdev;
 
-if (s->ctx) {
+if (s->ctx && !s->dataplane_started) {
 virtio_device_start_ioeventfd(vdev);
 if (!s->dataplane_fenced) {
 return;
-- 
2.33.1

[PATCH v2 5/6] virtio: use ->handle_output() instead of ->handle_aio_output()

2021-12-02 Thread Stefan Hajnoczi

The difference between ->handle_output() and ->handle_aio_output() was
that ->handle_aio_output() returned a bool return value indicating
progress. This was needed by the old polling API but now that the bool
return value is gone, the two functions can be unified.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 33 +++--
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index c042be3935..a97a406d3c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -125,7 +125,6 @@ struct VirtQueue
 
 uint16_t vector;
 VirtIOHandleOutput handle_output;
-VirtIOHandleOutput handle_aio_output;
 VirtIODevice *vdev;
 EventNotifier guest_notifier;
 EventNotifier host_notifier;
@@ -2300,20 +2299,6 @@ void virtio_queue_set_align(VirtIODevice *vdev, int n, 
int align)
 }
 }
 
-static void virtio_queue_notify_aio_vq(VirtQueue *vq)
-{
-if (vq->vring.desc && vq->handle_aio_output) {
-VirtIODevice *vdev = vq->vdev;
-
-trace_virtio_queue_notify(vdev, vq - vdev->vq, vq);
-vq->handle_aio_output(vdev, vq);
-
-if (unlikely(vdev->start_on_kick)) {
-virtio_set_started(vdev, true);
-}
-}
-}
-
 static void virtio_queue_notify_vq(VirtQueue *vq)
 {
 if (vq->vring.desc && vq->handle_output) {
@@ -2392,7 +2377,6 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 vdev->vq[i].vring.num_default = queue_size;
 vdev->vq[i].vring.align = VIRTIO_PCI_VRING_ALIGN;
 vdev->vq[i].handle_output = handle_output;
-vdev->vq[i].handle_aio_output = NULL;
 vdev->vq[i].used_elems = g_malloc0(sizeof(VirtQueueElement) *
queue_size);
 
@@ -2404,7 +2388,6 @@ void virtio_delete_queue(VirtQueue *vq)
 vq->vring.num = 0;
 vq->vring.num_default = 0;
 vq->handle_output = NULL;
-vq->handle_aio_output = NULL;
 g_free(vq->used_elems);
 vq->used_elems = NULL;
 virtio_virtqueue_reset_region_cache(vq);
@@ -3509,14 +3492,6 @@ EventNotifier *virtio_queue_get_guest_notifier(VirtQueue 
*vq)
 return &vq->guest_notifier;
 }
 
-static void virtio_queue_host_notifier_aio_read(EventNotifier *n)
-{
-VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
-if (event_notifier_test_and_clear(n)) {
-virtio_queue_notify_aio_vq(vq);
-}
-}
-
 static void virtio_queue_host_notifier_aio_poll_begin(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
@@ -3536,7 +3511,7 @@ static void 
virtio_queue_host_notifier_aio_poll_ready(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
 
-virtio_queue_notify_aio_vq(vq);
+virtio_queue_notify_vq(vq);
 }
 
 static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
@@ -3551,9 +3526,8 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 VirtIOHandleOutput handle_output)
 {
 if (handle_output) {
-vq->handle_aio_output = handle_output;
 aio_set_event_notifier(ctx, &vq->host_notifier, true,
-   virtio_queue_host_notifier_aio_read,
+   virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
 aio_set_event_notifier_poll(ctx, &vq->host_notifier,
@@ -3563,8 +3537,7 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, 
NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
-virtio_queue_host_notifier_aio_read(&vq->host_notifier);
-vq->handle_aio_output = NULL;
+virtio_queue_host_notifier_read(&vq->host_notifier);
 }
 }
 
-- 
2.33.1

[PATCH v2 6/6] virtio: unify dataplane and non-dataplane ->handle_output()

2021-12-02 Thread Stefan Hajnoczi

Now that virtio-blk and virtio-scsi are ready, get rid of
the handle_aio_output() callback. It's no longer needed.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  4 +--
 hw/block/dataplane/virtio-blk.c | 16 ++
 hw/scsi/virtio-scsi-dataplane.c | 54 -
 hw/virtio/virtio.c  | 32 +--
 4 files changed, 26 insertions(+), 80 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b90095628f..f095637058 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -316,8 +316,8 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
-void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleOutput handle_output);
+void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx);
+void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 1b50ccd38b..f88f08ef59 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,17 +154,6 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOBlock *s = (VirtIOBlock *)vdev;
-
-assert(s->dataplane);
-assert(s->dataplane_started);
-
-virtio_blk_handle_vq(s, vq);
-}
-
 /* Context: QEMU global mutex held */
 int virtio_blk_data_plane_start(VirtIODevice *vdev)
 {
@@ -258,8 +247,7 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 for (i = 0; i < nvqs; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx,
-virtio_blk_data_plane_handle_output);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
 }
 aio_context_release(s->ctx);
 return 0;
@@ -302,7 +290,7 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vq, s->ctx);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 76137de67f..29575cbaf6 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,45 +49,6 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error **errp)
 }
 }
 
-static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
-  VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_cmd_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
-   VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_ctrl_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_event_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
 {
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s)));
@@ -112,10 +73,10 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 int i;
 
-virtio_queue_aio_set_host_notifier_handler(vs->ctrl_vq, s->ctx, NULL);
-virtio_queue_aio_set_host_notifier_handler(vs->event_vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
 for (i = 0; i < vs->conf.num_queues; i++) {
-virtio_queue_aio_set_host_notifier_handler(vs->cmd_vqs[i], s->ctx, 
NULL);
+virtio_queue_aio_detach_host_notifier(vs->cmd_vqs[i]

Re: [PATCH v2 0/6] aio-posix: split poll check from ready handler

2021-12-02 Thread Stefan Hajnoczi

On Thu, Dec 02, 2021 at 03:49:08PM +, Richard W.M. Jones wrote:
> 
> Not sure if this is related, but builds are failing with:
> 
> FAILED: libblockdev.fa.p/block_export_fuse.c.o 
> cc -m64 -mcx16 -Ilibblockdev.fa.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader 
> -I/usr/include/fuse3 -I/usr/include/p11-kit-1 -I/usr/include/glib-2.0 
> -I/usr/lib64/glib-2.0/include -I/usr/include/sysprof-4 
> -fdiagnostics-color=auto -Wall -Winvalid-pch -std=gnu11 -O2 -g -isystem 
> /home/rjones/d/qemu/linux-headers -isystem linux-headers -iquote . -iquote 
> /home/rjones/d/qemu -iquote /home/rjones/d/qemu/include -iquote 
> /home/rjones/d/qemu/disas/libvixl -iquote /home/rjones/d/qemu/tcg/i386 
> -pthread -DSTAP_SDT_V2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE 
> -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes 
> -Wredundant-decls -Wundef -Wwrite-strings -Wmissing-prototypes 
> -fno-strict-aliasing -fno-common -fwrapv -Wold-style-declaration 
> -Wold-style-definition -Wtype-limits -Wformat-security -Wformat-y2k 
> -Winit-self -Wignored-qualifiers -Wempty-body -Wnested-externs -Wendif-labels 
> -Wexpansion-to-defined -Wimplicit-fallthrough=2 -Wno-missing-include-dirs 
> -Wno-shift-negative-value -Wno-psabi -fstack-protector-strong -fPIE -MD -MQ 
> libblockdev.fa.p/block_export_fuse.c.o -MF 
> libblockdev.fa.p/block_export_fuse.c.o.d -o 
> libblockdev.fa.p/block_export_fuse.c.o -c ../block/export/fuse.c
> ../block/export/fuse.c: In function ‘setup_fuse_export’:
> ../block/export/fuse.c:226:59: warning: passing argument 7 of 
> ‘aio_set_fd_handler’ from incompatible pointer type 
> [-Wincompatible-pointer-types]
>   226 |read_from_fuse_export, NULL, NULL, exp);
>   |   ^~~
>   |   |
>   |   FuseExport *
> In file included from ../block/export/fuse.c:22:
> /home/rjones/d/qemu/include/block/aio.h:472:36: note: expected ‘void (*)(void 
> *)’ but argument is of type ‘FuseExport *’
>   472 | IOHandler *io_poll_ready,
>   | ~~~^
> ../block/export/fuse.c:224:5: error: too few arguments to function 
> ‘aio_set_fd_handler’
>   224 | aio_set_fd_handler(exp->common.ctx,
>   | ^~
> In file included from ../block/export/fuse.c:22:
> /home/rjones/d/qemu/include/block/aio.h:466:6: note: declared here
>   466 | void aio_set_fd_handler(AioContext *ctx,
>   |  ^~
> ../block/export/fuse.c: In function ‘fuse_export_shutdown’:
> ../block/export/fuse.c:268:13: error: too few arguments to function 
> ‘aio_set_fd_handler’
>   268 | aio_set_fd_handler(exp->common.ctx,
>   | ^~
> In file included from ../block/export/fuse.c:22:
> /home/rjones/d/qemu/include/block/aio.h:466:6: note: declared here
>   466 | void aio_set_fd_handler(AioContext *ctx,
>   |  ^~

Yes, thanks!

Stefan


signature.asc
Description: PGP signature

[PATCH v3 0/6] aio-posix: split poll check from ready handler

2021-12-07 Thread Stefan Hajnoczi

v3:
- Fixed FUSE export aio_set_fd_handler() call that I missed and double-checked
  for any other missing call sites using Coccinelle [Rich]
v2:
- Cleaned up unused return values in nvme and virtio-blk [Stefano]
- Documented try_poll_mode() ready_list argument [Stefano]
- Unified virtio-blk/scsi dataplane and non-dataplane virtqueue handlers 
[Stefano]

The first patch improves AioContext's adaptive polling execution time
measurement. This can result in better performance because the algorithm makes
better decisions about when to poll versus when to fall back to file descriptor
monitoring.

The remaining patches unify the virtio-blk and virtio-scsi dataplane and
non-dataplane virtqueue handlers. This became possible because the dataplane
handler function now has the same function signature as the non-dataplane
handler function. Stefano Garzarella prompted me to make this refactoring.

Stefan Hajnoczi (6):
  aio-posix: split poll check from ready handler
  virtio: get rid of VirtIOHandleAIOOutput
  virtio-blk: drop unused virtio_blk_handle_vq() return value
  virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane
  virtio: use ->handle_output() instead of ->handle_aio_output()
  virtio: unify dataplane and non-dataplane ->handle_output()

 include/block/aio.h |  4 +-
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio.h  |  5 +-
 util/aio-posix.h|  1 +
 block/curl.c| 11 ++--
 block/export/fuse.c |  4 +-
 block/io_uring.c| 19 ---
 block/iscsi.c   |  4 +-
 block/linux-aio.c   | 16 +++---
 block/nfs.c |  6 +--
 block/nvme.c| 51 ---
 block/ssh.c |  4 +-
 block/win32-aio.c   |  4 +-
 hw/block/dataplane/virtio-blk.c | 16 +-
 hw/block/virtio-blk.c   | 14 ++
 hw/scsi/virtio-scsi-dataplane.c | 60 +++---
 hw/scsi/virtio-scsi.c   |  2 +-
 hw/virtio/virtio.c  | 73 +--
 hw/xen/xen-bus.c|  6 +--
 io/channel-command.c|  6 ++-
 io/channel-file.c   |  3 +-
 io/channel-socket.c |  3 +-
 migration/rdma.c|  8 +--
 tests/unit/test-aio.c   |  4 +-
 util/aio-posix.c| 89 +
 util/aio-win32.c|  4 +-
 util/async.c| 10 +++-
 util/main-loop.c|  4 +-
 util/qemu-coroutine-io.c|  5 +-
 util/vhost-user-server.c| 11 ++--
 30 files changed, 219 insertions(+), 230 deletions(-)

-- 
2.33.1

[PATCH v3 3/6] virtio-blk: drop unused virtio_blk_handle_vq() return value

2021-12-07 Thread Stefan Hajnoczi

The return value of virtio_blk_handle_vq() is no longer used. Get rid of
it. This is a step towards unifying the dataplane and non-dataplane
virtqueue handler functions.

Prepare virtio_blk_handle_output() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-blk.h |  2 +-
 hw/block/virtio-blk.c  | 14 +++---
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 29655a406d..d311c57cca 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -90,7 +90,7 @@ typedef struct MultiReqBuffer {
 bool is_write;
 } MultiReqBuffer;
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh);
 
 #endif
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9..82676cdd01 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -767,12 +767,11 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
MultiReqBuffer *mrb)
 return 0;
 }
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 {
 VirtIOBlockReq *req;
 MultiReqBuffer mrb = {};
 bool suppress_notifications = virtio_queue_get_notification(vq);
-bool progress = false;
 
 aio_context_acquire(blk_get_aio_context(s->blk));
 blk_io_plug(s->blk);
@@ -783,7 +782,6 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 }
 
 while ((req = virtio_blk_get_request(s, vq))) {
-progress = true;
 if (virtio_blk_handle_request(req, &mrb)) {
 virtqueue_detach_element(req->vq, &req->elem, 0);
 virtio_blk_free_request(req);
@@ -802,19 +800,13 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 
 blk_io_unplug(s->blk);
 aio_context_release(blk_get_aio_context(s->blk));
-return progress;
-}
-
-static void virtio_blk_handle_output_do(VirtIOBlock *s, VirtQueue *vq)
-{
-virtio_blk_handle_vq(s, vq);
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
 
-if (s->dataplane) {
+if (s->dataplane && !s->dataplane_started) {
 /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
  * dataplane here instead of waiting for .set_status().
  */
@@ -823,7 +815,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
 return;
 }
 }
-virtio_blk_handle_output_do(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
-- 
2.33.1

[PATCH v3 2/6] virtio: get rid of VirtIOHandleAIOOutput

2021-12-07 Thread Stefan Hajnoczi

The virtqueue host notifier API
virtio_queue_aio_set_host_notifier_handler() polls the virtqueue for new
buffers. AioContext previously required a bool progress return value
indicating whether an event was handled or not. This is no longer
necessary because the AioContext polling API has been split into a poll
check function and an event handler function. The event handler is only
run when we know there is work to do, so it doesn't return bool.

The VirtIOHandleAIOOutput function signature is now the same as
VirtIOHandleOutput. Get rid of the bool return value.

Further simplifications will be made for virtio-blk and virtio-scsi in
the next patch.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  3 +--
 hw/block/dataplane/virtio-blk.c |  4 ++--
 hw/scsi/virtio-scsi-dataplane.c | 18 ++
 hw/virtio/virtio.c  | 12 
 4 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..b90095628f 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -175,7 +175,6 @@ void virtio_error(VirtIODevice *vdev, const char *fmt, ...) 
GCC_FMT_ATTR(2, 3);
 void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name);
 
 typedef void (*VirtIOHandleOutput)(VirtIODevice *, VirtQueue *);
-typedef bool (*VirtIOHandleAIOOutput)(VirtIODevice *, VirtQueue *);
 
 VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 VirtIOHandleOutput handle_output);
@@ -318,7 +317,7 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue 
*vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleAIOOutput 
handle_output);
+VirtIOHandleOutput handle_output);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index ee5a5352dc..a2fa407b98 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,7 +154,7 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static bool virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
+static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
@@ -162,7 +162,7 @@ static bool 
virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 assert(s->dataplane);
 assert(s->dataplane_started);
 
-return virtio_blk_handle_vq(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 /* Context: QEMU global mutex held */
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 18eb824c97..76137de67f 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,49 +49,43 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error 
**errp)
 }
 }
 
-static bool virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
   VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_cmd_vq(s, vq);
+virtio_scsi_handle_cmd_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_ctrl_vq(s, vq);
+virtio_scsi_handle_ctrl_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
 VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_event_vq(s, vq);
+virtio_scsi_handle_event_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
diff --

[PATCH v3 4/6] virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane

2021-12-07 Thread Stefan Hajnoczi

Prepare virtio_scsi_handle_cmd() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/virtio-scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 51fd09522a..34a968ecfb 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -720,7 +720,7 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, 
VirtQueue *vq)
 /* use non-QOM casts in the data path */
 VirtIOSCSI *s = (VirtIOSCSI *)vdev;
 
-if (s->ctx) {
+if (s->ctx && !s->dataplane_started) {
 virtio_device_start_ioeventfd(vdev);
 if (!s->dataplane_fenced) {
 return;
-- 
2.33.1

[PATCH v3 1/6] aio-posix: split poll check from ready handler

2021-12-07 Thread Stefan Hajnoczi

Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 16, iocbpp: 0x7fcb9f937db0)= 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 
4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 32, iocbpp: 0x7fca7d0cebe0)= 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0cdd62be0)= 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0d0388b50)= 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h  |  4 +-
 util/aio-posix.h |  1 +
 block/curl.c | 11 ++---
 block/export/fuse.c  |  4 +-
 block/io_uring.c | 19 +
 block/iscsi.c|  4 +-
 block/linux-aio.c| 16 +---
 block/nfs.c  |  6 +--
 block/nvme.c | 51 +++
 block/ssh.c  |  4 +-
 block/win32-aio.c|  4 +-
 hw/virtio/virtio.c   | 16 +---
 hw/xen/xen-bus.c |  6 +--
 io/channel-command.c |  6 ++-
 io/channel-file.c|  3 +-
 io/channel-socket.c  |  3 +-
 migration/rdma.c |  8 ++--
 tests/unit/test-aio.c|  4 +-
 util/aio-posix.c | 89 ++--
 util/aio-win32.c |  4 +-
 util/async.c | 10 -
 util/main-loop.c |  4 +-
 util/qemu-coroutine-io.c |  5 ++-
 util/vhost-user-server.c | 11 ++---
 24 files changed, 191 insertions(+), 102 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 47fbe9d81f..5634173b12 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -469,6 +469,7 @@ void aio_set_fd_handler(AioContext *ctx,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
+IOHandler *io_poll_ready,
 void *opaque);
 
 /* Set polling begin/end callbacks for a file descriptor that has already been
@@ -490,7 +491,8 @@ void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
 bool is_external,
 EventNotifierHandler *io_read,
-AioPollFn *io_poll);
+AioPollFn *io_poll,
+EventNotifierHandler *io_poll_ready);
 
 /* Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
diff --git a/util/aio-posix.h b/util/aio-posix.h
index c80c04506a..7f2c37a684 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -24,6 +24,7 @@ struct AioHandler {
 IOHandler *io_read;
 IOHandler *io_write;
 AioPollF

[PATCH v3 6/6] virtio: unify dataplane and non-dataplane ->handle_output()

2021-12-07 Thread Stefan Hajnoczi

Now that virtio-blk and virtio-scsi are ready, get rid of
the handle_aio_output() callback. It's no longer needed.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  4 +--
 hw/block/dataplane/virtio-blk.c | 16 ++
 hw/scsi/virtio-scsi-dataplane.c | 54 -
 hw/virtio/virtio.c  | 32 +--
 4 files changed, 26 insertions(+), 80 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b90095628f..f095637058 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -316,8 +316,8 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
-void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleOutput handle_output);
+void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx);
+void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index a2fa407b98..49276e46f2 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,17 +154,6 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOBlock *s = (VirtIOBlock *)vdev;
-
-assert(s->dataplane);
-assert(s->dataplane_started);
-
-virtio_blk_handle_vq(s, vq);
-}
-
 /* Context: QEMU global mutex held */
 int virtio_blk_data_plane_start(VirtIODevice *vdev)
 {
@@ -258,8 +247,7 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 for (i = 0; i < nvqs; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx,
-virtio_blk_data_plane_handle_output);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
 }
 aio_context_release(s->ctx);
 return 0;
@@ -302,7 +290,7 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vq, s->ctx);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 76137de67f..29575cbaf6 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,45 +49,6 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error **errp)
 }
 }
 
-static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
-  VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_cmd_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
-   VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_ctrl_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_event_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
 {
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s)));
@@ -112,10 +73,10 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 int i;
 
-virtio_queue_aio_set_host_notifier_handler(vs->ctrl_vq, s->ctx, NULL);
-virtio_queue_aio_set_host_notifier_handler(vs->event_vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
 for (i = 0; i < vs->conf.num_queues; i++) {
-virtio_queue_aio_set_host_notifier_handler(vs->cmd_vqs[i], s->ctx, 
NULL);
+virtio_queue_aio_detach_host_notifier(vs->cmd_vqs[i]

[PATCH v3 5/6] virtio: use ->handle_output() instead of ->handle_aio_output()

2021-12-07 Thread Stefan Hajnoczi

The difference between ->handle_output() and ->handle_aio_output() was
that ->handle_aio_output() returned a bool return value indicating
progress. This was needed by the old polling API but now that the bool
return value is gone, the two functions can be unified.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 33 +++--
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index c042be3935..a97a406d3c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -125,7 +125,6 @@ struct VirtQueue
 
 uint16_t vector;
 VirtIOHandleOutput handle_output;
-VirtIOHandleOutput handle_aio_output;
 VirtIODevice *vdev;
 EventNotifier guest_notifier;
 EventNotifier host_notifier;
@@ -2300,20 +2299,6 @@ void virtio_queue_set_align(VirtIODevice *vdev, int n, 
int align)
 }
 }
 
-static void virtio_queue_notify_aio_vq(VirtQueue *vq)
-{
-if (vq->vring.desc && vq->handle_aio_output) {
-VirtIODevice *vdev = vq->vdev;
-
-trace_virtio_queue_notify(vdev, vq - vdev->vq, vq);
-vq->handle_aio_output(vdev, vq);
-
-if (unlikely(vdev->start_on_kick)) {
-virtio_set_started(vdev, true);
-}
-}
-}
-
 static void virtio_queue_notify_vq(VirtQueue *vq)
 {
 if (vq->vring.desc && vq->handle_output) {
@@ -2392,7 +2377,6 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 vdev->vq[i].vring.num_default = queue_size;
 vdev->vq[i].vring.align = VIRTIO_PCI_VRING_ALIGN;
 vdev->vq[i].handle_output = handle_output;
-vdev->vq[i].handle_aio_output = NULL;
 vdev->vq[i].used_elems = g_malloc0(sizeof(VirtQueueElement) *
queue_size);
 
@@ -2404,7 +2388,6 @@ void virtio_delete_queue(VirtQueue *vq)
 vq->vring.num = 0;
 vq->vring.num_default = 0;
 vq->handle_output = NULL;
-vq->handle_aio_output = NULL;
 g_free(vq->used_elems);
 vq->used_elems = NULL;
 virtio_virtqueue_reset_region_cache(vq);
@@ -3509,14 +3492,6 @@ EventNotifier *virtio_queue_get_guest_notifier(VirtQueue 
*vq)
 return &vq->guest_notifier;
 }
 
-static void virtio_queue_host_notifier_aio_read(EventNotifier *n)
-{
-VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
-if (event_notifier_test_and_clear(n)) {
-virtio_queue_notify_aio_vq(vq);
-}
-}
-
 static void virtio_queue_host_notifier_aio_poll_begin(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
@@ -3536,7 +3511,7 @@ static void 
virtio_queue_host_notifier_aio_poll_ready(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
 
-virtio_queue_notify_aio_vq(vq);
+virtio_queue_notify_vq(vq);
 }
 
 static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
@@ -3551,9 +3526,8 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 VirtIOHandleOutput handle_output)
 {
 if (handle_output) {
-vq->handle_aio_output = handle_output;
 aio_set_event_notifier(ctx, &vq->host_notifier, true,
-   virtio_queue_host_notifier_aio_read,
+   virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
 aio_set_event_notifier_poll(ctx, &vq->host_notifier,
@@ -3563,8 +3537,7 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, 
NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
-virtio_queue_host_notifier_aio_read(&vq->host_notifier);
-vq->handle_aio_output = NULL;
+virtio_queue_host_notifier_read(&vq->host_notifier);
 }
 }
 
-- 
2.33.1

Re: [PATCH v3 0/6] aio-posix: split poll check from ready handler

2021-12-13 Thread Stefan Hajnoczi

On Tue, Dec 07, 2021 at 01:23:30PM +, Stefan Hajnoczi wrote:
> v3:
> - Fixed FUSE export aio_set_fd_handler() call that I missed and double-checked
>   for any other missing call sites using Coccinelle [Rich]
> v2:
> - Cleaned up unused return values in nvme and virtio-blk [Stefano]
> - Documented try_poll_mode() ready_list argument [Stefano]
> - Unified virtio-blk/scsi dataplane and non-dataplane virtqueue handlers 
> [Stefano]
> 
> The first patch improves AioContext's adaptive polling execution time
> measurement. This can result in better performance because the algorithm makes
> better decisions about when to poll versus when to fall back to file 
> descriptor
> monitoring.
> 
> The remaining patches unify the virtio-blk and virtio-scsi dataplane and
> non-dataplane virtqueue handlers. This became possible because the dataplane
> handler function now has the same function signature as the non-dataplane
> handler function. Stefano Garzarella prompted me to make this refactoring.
> 
> Stefan Hajnoczi (6):
>   aio-posix: split poll check from ready handler
>   virtio: get rid of VirtIOHandleAIOOutput
>   virtio-blk: drop unused virtio_blk_handle_vq() return value
>   virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane
>   virtio: use ->handle_output() instead of ->handle_aio_output()
>   virtio: unify dataplane and non-dataplane ->handle_output()
> 
>  include/block/aio.h |  4 +-
>  include/hw/virtio/virtio-blk.h  |  2 +-
>  include/hw/virtio/virtio.h  |  5 +-
>  util/aio-posix.h|  1 +
>  block/curl.c| 11 ++--
>  block/export/fuse.c |  4 +-
>  block/io_uring.c| 19 ---
>  block/iscsi.c   |  4 +-
>  block/linux-aio.c   | 16 +++---
>  block/nfs.c |  6 +--
>  block/nvme.c| 51 ---
>  block/ssh.c |  4 +-
>  block/win32-aio.c   |  4 +-
>  hw/block/dataplane/virtio-blk.c | 16 +-
>  hw/block/virtio-blk.c   | 14 ++
>  hw/scsi/virtio-scsi-dataplane.c | 60 +++---
>  hw/scsi/virtio-scsi.c   |  2 +-
>  hw/virtio/virtio.c  | 73 +--
>  hw/xen/xen-bus.c|  6 +--
>  io/channel-command.c|  6 ++-
>  io/channel-file.c   |  3 +-
>  io/channel-socket.c |  3 +-
>  migration/rdma.c|  8 +--
>  tests/unit/test-aio.c   |  4 +-
>  util/aio-posix.c| 89 +
>  util/aio-win32.c|  4 +-
>  util/async.c| 10 +++-
>  util/main-loop.c|  4 +-
>  util/qemu-coroutine-io.c|  5 +-
>  util/vhost-user-server.c| 11 ++--
>  30 files changed, 219 insertions(+), 230 deletions(-)
> 
> -- 
> 2.33.1
> 
> 

Thanks, applied to my block-next tree:
https://gitlab.com/stefanha/qemu/commits/block-next

Stefan


signature.asc
Description: PGP signature

[PULL 0/6] Block patches

2022-01-12 Thread Stefan Hajnoczi

The following changes since commit 91f5f7a5df1fda8c34677a7c49ee8a4bb5b56a36:

  Merge remote-tracking branch 
'remotes/lvivier-gitlab/tags/linux-user-for-7.0-pull-request' into staging 
(2022-01-12 11:51:47 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to db608fb78444c58896db69495729e4458eeaace1:

  virtio: unify dataplane and non-dataplane ->handle_output() (2022-01-12 
17:09:39 +)


Pull request

----

Stefan Hajnoczi (6):
  aio-posix: split poll check from ready handler
  virtio: get rid of VirtIOHandleAIOOutput
  virtio-blk: drop unused virtio_blk_handle_vq() return value
  virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane
  virtio: use ->handle_output() instead of ->handle_aio_output()
  virtio: unify dataplane and non-dataplane ->handle_output()

 include/block/aio.h |  4 +-
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio.h  |  5 +-
 util/aio-posix.h|  1 +
 block/curl.c| 11 ++--
 block/export/fuse.c |  4 +-
 block/io_uring.c| 19 ---
 block/iscsi.c   |  4 +-
 block/linux-aio.c   | 16 +++---
 block/nfs.c |  6 +--
 block/nvme.c| 51 ---
 block/ssh.c |  4 +-
 block/win32-aio.c   |  4 +-
 hw/block/dataplane/virtio-blk.c | 16 +-
 hw/block/virtio-blk.c   | 14 ++
 hw/scsi/virtio-scsi-dataplane.c | 60 +++---
 hw/scsi/virtio-scsi.c   |  2 +-
 hw/virtio/virtio.c  | 73 +--
 hw/xen/xen-bus.c|  6 +--
 io/channel-command.c|  6 ++-
 io/channel-file.c   |  3 +-
 io/channel-socket.c |  3 +-
 migration/rdma.c|  8 +--
 tests/unit/test-aio.c   |  4 +-
 tests/unit/test-fdmon-epoll.c   |  4 +-
 util/aio-posix.c| 89 +
 util/aio-win32.c|  4 +-
 util/async.c| 10 +++-
 util/main-loop.c|  4 +-
 util/qemu-coroutine-io.c|  5 +-
 util/vhost-user-server.c| 11 ++--
 31 files changed, 221 insertions(+), 232 deletions(-)

-- 
2.34.1

[PULL 1/6] aio-posix: split poll check from ready handler

2022-01-12 Thread Stefan Hajnoczi

Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 16, iocbpp: 0x7fcb9f937db0)= 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 
4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 32, iocbpp: 0x7fca7d0cebe0)= 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0cdd62be0)= 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0d0388b50)= 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20211207132336.36627-2-stefa...@redhat.com

[Fixed up aio_set_event_notifier() calls in
tests/unit/test-fdmon-epoll.c added after this series was queued.
--Stefan]

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h   |  4 +-
 util/aio-posix.h  |  1 +
 block/curl.c  | 11 +++--
 block/export/fuse.c   |  4 +-
 block/io_uring.c  | 19 
 block/iscsi.c |  4 +-
 block/linux-aio.c | 16 ---
 block/nfs.c   |  6 +--
 block/nvme.c  | 51 +---
 block/ssh.c   |  4 +-
 block/win32-aio.c |  4 +-
 hw/virtio/virtio.c| 16 ---
 hw/xen/xen-bus.c  |  6 +--
 io/channel-command.c  |  6 ++-
 io/channel-file.c |  3 +-
 io/channel-socket.c   |  3 +-
 migration/rdma.c  |  8 ++--
 tests/unit/test-aio.c |  4 +-
 tests/unit/test-fdmon-epoll.c |  4 +-
 util/aio-posix.c  | 89 ++-
 util/aio-win32.c  |  4 +-
 util/async.c  | 10 +++-
 util/main-loop.c  |  4 +-
 util/qemu-coroutine-io.c  |  5 +-
 util/vhost-user-server.c  | 11 +++--
 25 files changed, 193 insertions(+), 104 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 47fbe9d81f..5634173b12 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -469,6 +469,7 @@ void aio_set_fd_handler(AioContext *ctx,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
+IOHandler *io_poll_ready,
 void *opaque);
 
 /* Set polling begin/end callbacks for a file descriptor that has already been
@@ -490,7 +491,8 @@ void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
 bool is_external,
 EventNotifierHandler *io_read,
-AioPollFn *io_poll);
+AioPollFn *io_poll,
+EventNotifierHandler *io_poll_ready);
 
 /* Set polling

[PULL 3/6] virtio-blk: drop unused virtio_blk_handle_vq() return value

2022-01-12 Thread Stefan Hajnoczi

The return value of virtio_blk_handle_vq() is no longer used. Get rid of
it. This is a step towards unifying the dataplane and non-dataplane
virtqueue handler functions.

Prepare virtio_blk_handle_output() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20211207132336.36627-4-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-blk.h |  2 +-
 hw/block/virtio-blk.c  | 14 +++---
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 29655a406d..d311c57cca 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -90,7 +90,7 @@ typedef struct MultiReqBuffer {
 bool is_write;
 } MultiReqBuffer;
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh);
 
 #endif
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9..82676cdd01 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -767,12 +767,11 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
MultiReqBuffer *mrb)
 return 0;
 }
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 {
 VirtIOBlockReq *req;
 MultiReqBuffer mrb = {};
 bool suppress_notifications = virtio_queue_get_notification(vq);
-bool progress = false;
 
 aio_context_acquire(blk_get_aio_context(s->blk));
 blk_io_plug(s->blk);
@@ -783,7 +782,6 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 }
 
 while ((req = virtio_blk_get_request(s, vq))) {
-progress = true;
 if (virtio_blk_handle_request(req, &mrb)) {
 virtqueue_detach_element(req->vq, &req->elem, 0);
 virtio_blk_free_request(req);
@@ -802,19 +800,13 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 
 blk_io_unplug(s->blk);
 aio_context_release(blk_get_aio_context(s->blk));
-return progress;
-}
-
-static void virtio_blk_handle_output_do(VirtIOBlock *s, VirtQueue *vq)
-{
-virtio_blk_handle_vq(s, vq);
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
 
-if (s->dataplane) {
+if (s->dataplane && !s->dataplane_started) {
 /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
  * dataplane here instead of waiting for .set_status().
  */
@@ -823,7 +815,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
 return;
 }
 }
-virtio_blk_handle_output_do(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
-- 
2.34.1

[PULL 4/6] virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane

2022-01-12 Thread Stefan Hajnoczi

Prepare virtio_scsi_handle_cmd() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20211207132336.36627-5-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/virtio-scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 51fd09522a..34a968ecfb 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -720,7 +720,7 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, 
VirtQueue *vq)
 /* use non-QOM casts in the data path */
 VirtIOSCSI *s = (VirtIOSCSI *)vdev;
 
-if (s->ctx) {
+if (s->ctx && !s->dataplane_started) {
 virtio_device_start_ioeventfd(vdev);
 if (!s->dataplane_fenced) {
 return;
-- 
2.34.1

[PULL 5/6] virtio: use ->handle_output() instead of ->handle_aio_output()

2022-01-12 Thread Stefan Hajnoczi

The difference between ->handle_output() and ->handle_aio_output() was
that ->handle_aio_output() returned a bool return value indicating
progress. This was needed by the old polling API but now that the bool
return value is gone, the two functions can be unified.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20211207132336.36627-6-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 33 +++--
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 323f549aad..e938e6513b 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -125,7 +125,6 @@ struct VirtQueue
 
 uint16_t vector;
 VirtIOHandleOutput handle_output;
-VirtIOHandleOutput handle_aio_output;
 VirtIODevice *vdev;
 EventNotifier guest_notifier;
 EventNotifier host_notifier;
@@ -2303,20 +2302,6 @@ void virtio_queue_set_align(VirtIODevice *vdev, int n, 
int align)
 }
 }
 
-static void virtio_queue_notify_aio_vq(VirtQueue *vq)
-{
-if (vq->vring.desc && vq->handle_aio_output) {
-VirtIODevice *vdev = vq->vdev;
-
-trace_virtio_queue_notify(vdev, vq - vdev->vq, vq);
-vq->handle_aio_output(vdev, vq);
-
-if (unlikely(vdev->start_on_kick)) {
-virtio_set_started(vdev, true);
-}
-}
-}
-
 static void virtio_queue_notify_vq(VirtQueue *vq)
 {
 if (vq->vring.desc && vq->handle_output) {
@@ -2395,7 +2380,6 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 vdev->vq[i].vring.num_default = queue_size;
 vdev->vq[i].vring.align = VIRTIO_PCI_VRING_ALIGN;
 vdev->vq[i].handle_output = handle_output;
-vdev->vq[i].handle_aio_output = NULL;
 vdev->vq[i].used_elems = g_malloc0(sizeof(VirtQueueElement) *
queue_size);
 
@@ -2407,7 +2391,6 @@ void virtio_delete_queue(VirtQueue *vq)
 vq->vring.num = 0;
 vq->vring.num_default = 0;
 vq->handle_output = NULL;
-vq->handle_aio_output = NULL;
 g_free(vq->used_elems);
 vq->used_elems = NULL;
 virtio_virtqueue_reset_region_cache(vq);
@@ -3512,14 +3495,6 @@ EventNotifier *virtio_queue_get_guest_notifier(VirtQueue 
*vq)
 return &vq->guest_notifier;
 }
 
-static void virtio_queue_host_notifier_aio_read(EventNotifier *n)
-{
-VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
-if (event_notifier_test_and_clear(n)) {
-virtio_queue_notify_aio_vq(vq);
-}
-}
-
 static void virtio_queue_host_notifier_aio_poll_begin(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
@@ -3539,7 +3514,7 @@ static void 
virtio_queue_host_notifier_aio_poll_ready(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
 
-virtio_queue_notify_aio_vq(vq);
+virtio_queue_notify_vq(vq);
 }
 
 static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
@@ -3554,9 +3529,8 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 VirtIOHandleOutput handle_output)
 {
 if (handle_output) {
-vq->handle_aio_output = handle_output;
 aio_set_event_notifier(ctx, &vq->host_notifier, true,
-   virtio_queue_host_notifier_aio_read,
+   virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
 aio_set_event_notifier_poll(ctx, &vq->host_notifier,
@@ -3566,8 +3540,7 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL, NULL, 
NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
-virtio_queue_host_notifier_aio_read(&vq->host_notifier);
-vq->handle_aio_output = NULL;
+virtio_queue_host_notifier_read(&vq->host_notifier);
 }
 }
 
-- 
2.34.1

[PULL 6/6] virtio: unify dataplane and non-dataplane ->handle_output()

2022-01-12 Thread Stefan Hajnoczi

Now that virtio-blk and virtio-scsi are ready, get rid of
the handle_aio_output() callback. It's no longer needed.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20211207132336.36627-7-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  4 +--
 hw/block/dataplane/virtio-blk.c | 16 ++
 hw/scsi/virtio-scsi-dataplane.c | 54 -
 hw/virtio/virtio.c  | 32 +--
 4 files changed, 26 insertions(+), 80 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b90095628f..f095637058 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -316,8 +316,8 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
-void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleOutput handle_output);
+void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx);
+void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index a2fa407b98..49276e46f2 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,17 +154,6 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOBlock *s = (VirtIOBlock *)vdev;
-
-assert(s->dataplane);
-assert(s->dataplane_started);
-
-virtio_blk_handle_vq(s, vq);
-}
-
 /* Context: QEMU global mutex held */
 int virtio_blk_data_plane_start(VirtIODevice *vdev)
 {
@@ -258,8 +247,7 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 for (i = 0; i < nvqs; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx,
-virtio_blk_data_plane_handle_output);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
 }
 aio_context_release(s->ctx);
 return 0;
@@ -302,7 +290,7 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vq, s->ctx);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 76137de67f..29575cbaf6 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,45 +49,6 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error **errp)
 }
 }
 
-static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
-  VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_cmd_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
-   VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_ctrl_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_event_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
 {
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s)));
@@ -112,10 +73,10 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 int i;
 
-virtio_queue_aio_set_host_notifier_handler(vs->ctrl_vq, s->ctx, NULL);
-virtio_queue_aio_set_host_notifier_handler(vs->event_vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
 for (i = 0; i < vs->conf.num_queues; i++) {
-virtio_queue_aio_set_host_notifier_handler(vs->cm

[PULL 2/6] virtio: get rid of VirtIOHandleAIOOutput

2022-01-12 Thread Stefan Hajnoczi

The virtqueue host notifier API
virtio_queue_aio_set_host_notifier_handler() polls the virtqueue for new
buffers. AioContext previously required a bool progress return value
indicating whether an event was handled or not. This is no longer
necessary because the AioContext polling API has been split into a poll
check function and an event handler function. The event handler is only
run when we know there is work to do, so it doesn't return bool.

The VirtIOHandleAIOOutput function signature is now the same as
VirtIOHandleOutput. Get rid of the bool return value.

Further simplifications will be made for virtio-blk and virtio-scsi in
the next patch.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Stefano Garzarella 
Message-id: 20211207132336.36627-3-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  3 +--
 hw/block/dataplane/virtio-blk.c |  4 ++--
 hw/scsi/virtio-scsi-dataplane.c | 18 ++
 hw/virtio/virtio.c  | 12 
 4 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..b90095628f 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -175,7 +175,6 @@ void virtio_error(VirtIODevice *vdev, const char *fmt, ...) 
GCC_FMT_ATTR(2, 3);
 void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name);
 
 typedef void (*VirtIOHandleOutput)(VirtIODevice *, VirtQueue *);
-typedef bool (*VirtIOHandleAIOOutput)(VirtIODevice *, VirtQueue *);
 
 VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 VirtIOHandleOutput handle_output);
@@ -318,7 +317,7 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue 
*vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleAIOOutput 
handle_output);
+VirtIOHandleOutput handle_output);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index ee5a5352dc..a2fa407b98 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,7 +154,7 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static bool virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
+static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
@@ -162,7 +162,7 @@ static bool 
virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 assert(s->dataplane);
 assert(s->dataplane_started);
 
-return virtio_blk_handle_vq(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 /* Context: QEMU global mutex held */
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 18eb824c97..76137de67f 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,49 +49,43 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error 
**errp)
 }
 }
 
-static bool virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
   VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_cmd_vq(s, vq);
+virtio_scsi_handle_cmd_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_ctrl_vq(s, vq);
+virtio_scsi_handle_ctrl_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
 VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_event_vq(s, vq);
+virtio_scsi_handle_event_vq(s, vq);
 }

[PATCH 02/12] tests: remove aio_context_acquire() tests

2023-11-29 Thread Stefan Hajnoczi

The aio_context_acquire() API is being removed. Drop the test case that
calls the API.

Signed-off-by: Stefan Hajnoczi 
---
 tests/unit/test-aio.c | 67 +--
 1 file changed, 1 insertion(+), 66 deletions(-)

diff --git a/tests/unit/test-aio.c b/tests/unit/test-aio.c
index 337b6e4ea7..e77d86be87 100644
--- a/tests/unit/test-aio.c
+++ b/tests/unit/test-aio.c
@@ -100,76 +100,12 @@ static void event_ready_cb(EventNotifier *e)
 
 /* Tests using aio_*.  */
 
-typedef struct {
-QemuMutex start_lock;
-EventNotifier notifier;
-bool thread_acquired;
-} AcquireTestData;
-
-static void *test_acquire_thread(void *opaque)
-{
-AcquireTestData *data = opaque;
-
-/* Wait for other thread to let us start */
-qemu_mutex_lock(&data->start_lock);
-qemu_mutex_unlock(&data->start_lock);
-
-/* event_notifier_set might be called either before or after
- * the main thread's call to poll().  The test case's outcome
- * should be the same in either case.
- */
-event_notifier_set(&data->notifier);
-aio_context_acquire(ctx);
-aio_context_release(ctx);
-
-data->thread_acquired = true; /* success, we got here */
-
-return NULL;
-}
-
 static void set_event_notifier(AioContext *nctx, EventNotifier *notifier,
EventNotifierHandler *handler)
 {
 aio_set_event_notifier(nctx, notifier, handler, NULL, NULL);
 }
 
-static void dummy_notifier_read(EventNotifier *n)
-{
-event_notifier_test_and_clear(n);
-}
-
-static void test_acquire(void)
-{
-QemuThread thread;
-AcquireTestData data;
-
-/* Dummy event notifier ensures aio_poll() will block */
-event_notifier_init(&data.notifier, false);
-set_event_notifier(ctx, &data.notifier, dummy_notifier_read);
-g_assert(!aio_poll(ctx, false)); /* consume aio_notify() */
-
-qemu_mutex_init(&data.start_lock);
-qemu_mutex_lock(&data.start_lock);
-data.thread_acquired = false;
-
-qemu_thread_create(&thread, "test_acquire_thread",
-   test_acquire_thread,
-   &data, QEMU_THREAD_JOINABLE);
-
-/* Block in aio_poll(), let other thread kick us and acquire context */
-aio_context_acquire(ctx);
-qemu_mutex_unlock(&data.start_lock); /* let the thread run */
-g_assert(aio_poll(ctx, true));
-g_assert(!data.thread_acquired);
-aio_context_release(ctx);
-
-qemu_thread_join(&thread);
-set_event_notifier(ctx, &data.notifier, NULL);
-event_notifier_cleanup(&data.notifier);
-
-g_assert(data.thread_acquired);
-}
-
 static void test_bh_schedule(void)
 {
 BHTestData data = { .n = 0 };
@@ -879,7 +815,7 @@ static void test_worker_thread_co_enter(void)
 qemu_thread_get_self(&this_thread);
 co = qemu_coroutine_create(co_check_current_thread, &this_thread);
 
-qemu_thread_create(&worker_thread, "test_acquire_thread",
+qemu_thread_create(&worker_thread, "test_aio_co_enter",
test_aio_co_enter,
co, QEMU_THREAD_JOINABLE);
 
@@ -899,7 +835,6 @@ int main(int argc, char **argv)
 while (g_main_context_iteration(NULL, false));
 
 g_test_init(&argc, &argv, NULL);
-g_test_add_func("/aio/acquire", test_acquire);
 g_test_add_func("/aio/bh/schedule", test_bh_schedule);
 g_test_add_func("/aio/bh/schedule10",   test_bh_schedule10);
 g_test_add_func("/aio/bh/cancel",   test_bh_cancel);
-- 
2.42.0

[PATCH 01/12] virtio-scsi: replace AioContext lock with tmf_bh_lock

2023-11-29 Thread Stefan Hajnoczi

Protect the Task Management Function BH state with a lock. The TMF BH
runs in the main loop thread. An IOThread might process a TMF at the
same time as the TMF BH is running. Therefore tmf_bh_list and tmf_bh
must be protected by a lock.

Run TMF request completion in the IOThread using aio_wait_bh_oneshot().
This avoids more locking to protect the virtqueue and SCSI layer state.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-scsi.h |  3 +-
 hw/scsi/virtio-scsi.c   | 62 ++---
 2 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 779568ab5d..da8cb928d9 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -85,8 +85,9 @@ struct VirtIOSCSI {
 
 /*
  * TMFs deferred to main loop BH. These fields are protected by
- * virtio_scsi_acquire().
+ * tmf_bh_lock.
  */
+QemuMutex tmf_bh_lock;
 QEMUBH *tmf_bh;
 QTAILQ_HEAD(, VirtIOSCSIReq) tmf_bh_list;
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 9c751bf296..4f8d35facc 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -123,6 +123,30 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
 virtio_scsi_free_req(req);
 }
 
+static void virtio_scsi_complete_req_bh(void *opaque)
+{
+VirtIOSCSIReq *req = opaque;
+
+virtio_scsi_complete_req(req);
+}
+
+/*
+ * Called from virtio_scsi_do_one_tmf_bh() in main loop thread. The main loop
+ * thread cannot touch the virtqueue since that could race with an IOThread.
+ */
+static void virtio_scsi_complete_req_from_main_loop(VirtIOSCSIReq *req)
+{
+VirtIOSCSI *s = req->dev;
+
+if (!s->ctx || s->ctx == qemu_get_aio_context()) {
+/* No need to schedule a BH when there is no IOThread */
+virtio_scsi_complete_req(req);
+} else {
+/* Run request completion in the IOThread */
+aio_wait_bh_oneshot(s->ctx, virtio_scsi_complete_req_bh, req);
+}
+}
+
 static void virtio_scsi_bad_req(VirtIOSCSIReq *req)
 {
 virtio_error(VIRTIO_DEVICE(req->dev), "wrong size for virtio-scsi 
headers");
@@ -338,10 +362,7 @@ static void virtio_scsi_do_one_tmf_bh(VirtIOSCSIReq *req)
 
 out:
 object_unref(OBJECT(d));
-
-virtio_scsi_acquire(s);
-virtio_scsi_complete_req(req);
-virtio_scsi_release(s);
+virtio_scsi_complete_req_from_main_loop(req);
 }
 
 /* Some TMFs must be processed from the main loop thread */
@@ -354,18 +375,16 @@ static void virtio_scsi_do_tmf_bh(void *opaque)
 
 GLOBAL_STATE_CODE();
 
-virtio_scsi_acquire(s);
+WITH_QEMU_LOCK_GUARD(&s->tmf_bh_lock) {
+QTAILQ_FOREACH_SAFE(req, &s->tmf_bh_list, next, tmp) {
+QTAILQ_REMOVE(&s->tmf_bh_list, req, next);
+QTAILQ_INSERT_TAIL(&reqs, req, next);
+}
 
-QTAILQ_FOREACH_SAFE(req, &s->tmf_bh_list, next, tmp) {
-QTAILQ_REMOVE(&s->tmf_bh_list, req, next);
-QTAILQ_INSERT_TAIL(&reqs, req, next);
+qemu_bh_delete(s->tmf_bh);
+s->tmf_bh = NULL;
 }
 
-qemu_bh_delete(s->tmf_bh);
-s->tmf_bh = NULL;
-
-virtio_scsi_release(s);
-
 QTAILQ_FOREACH_SAFE(req, &reqs, next, tmp) {
 QTAILQ_REMOVE(&reqs, req, next);
 virtio_scsi_do_one_tmf_bh(req);
@@ -379,8 +398,7 @@ static void virtio_scsi_reset_tmf_bh(VirtIOSCSI *s)
 
 GLOBAL_STATE_CODE();
 
-virtio_scsi_acquire(s);
-
+/* Called after ioeventfd has been stopped, so tmf_bh_lock is not needed */
 if (s->tmf_bh) {
 qemu_bh_delete(s->tmf_bh);
 s->tmf_bh = NULL;
@@ -393,19 +411,19 @@ static void virtio_scsi_reset_tmf_bh(VirtIOSCSI *s)
 req->resp.tmf.response = VIRTIO_SCSI_S_TARGET_FAILURE;
 virtio_scsi_complete_req(req);
 }
-
-virtio_scsi_release(s);
 }
 
 static void virtio_scsi_defer_tmf_to_bh(VirtIOSCSIReq *req)
 {
 VirtIOSCSI *s = req->dev;
 
-QTAILQ_INSERT_TAIL(&s->tmf_bh_list, req, next);
+WITH_QEMU_LOCK_GUARD(&s->tmf_bh_lock) {
+QTAILQ_INSERT_TAIL(&s->tmf_bh_list, req, next);
 
-if (!s->tmf_bh) {
-s->tmf_bh = qemu_bh_new(virtio_scsi_do_tmf_bh, s);
-qemu_bh_schedule(s->tmf_bh);
+if (!s->tmf_bh) {
+s->tmf_bh = qemu_bh_new(virtio_scsi_do_tmf_bh, s);
+qemu_bh_schedule(s->tmf_bh);
+}
 }
 }
 
@@ -1235,6 +1253,7 @@ static void virtio_scsi_device_realize(DeviceState *dev, 
Error **errp)
 Error *err = NULL;
 
 QTAILQ_INIT(&s->tmf_bh_list);
+qemu_mutex_init(&s->tmf_bh_lock);
 
 virtio_scsi_common_realize(dev,
virtio_scsi_handle_ctrl,
@@ -1277,6 +1296,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev)
 
 qbus_set_hotplug_handler(BUS(&s->bus), NULL);
 virtio_scsi_common_unrealize(dev);
+qemu_mutex_destroy(&s->tmf_bh_lock);
 }
 
 static Property virtio_scsi_properties[] = {
-- 
2.42.0

[PATCH 03/12] aio: make aio_context_acquire()/aio_context_release() a no-op

2023-11-29 Thread Stefan Hajnoczi

aio_context_acquire()/aio_context_release() has been replaced by
fine-grained locking to protect state shared by multiple threads. The
AioContext lock still plays the role of balancing locking in
AIO_WAIT_WHILE() and many functions in QEMU either require that the
AioContext lock is held or not held for this reason. In other words, the
AioContext lock is purely there for consistency with itself and serves
no real purpose anymore.

Stop actually acquiring/releasing the lock in
aio_context_acquire()/aio_context_release() so that subsequent patches
can remove callers across the codebase incrementally.

I have performed "make check" and qemu-iotests stress tests across
x86-64, ppc64le, and aarch64 to confirm that there are no failures as a
result of eliminating the lock.

Signed-off-by: Stefan Hajnoczi 
---
 util/async.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/util/async.c b/util/async.c
index 8f90ddc304..04ee83d220 100644
--- a/util/async.c
+++ b/util/async.c
@@ -725,12 +725,12 @@ void aio_context_unref(AioContext *ctx)
 
 void aio_context_acquire(AioContext *ctx)
 {
-qemu_rec_mutex_lock(&ctx->lock);
+/* TODO remove this function */
 }
 
 void aio_context_release(AioContext *ctx)
 {
-qemu_rec_mutex_unlock(&ctx->lock);
+/* TODO remove this function */
 }
 
 QEMU_DEFINE_STATIC_CO_TLS(AioContext *, my_aiocontext)
-- 
2.42.0

[PATCH 00/12] aio: remove AioContext lock

2023-11-29 Thread Stefan Hajnoczi

This series removes the AioContext locking APIs from QEMU.
aio_context_acquire() and aio_context_release() are currently only needed to
support the locking discipline required by AIO_POLL_WHILE() (except for a stray
user that I converted in Patch 1). AIO_POLL_WHILE() doesn't really need the
AioContext lock anymore, so it's possible to remove the API. This is a nice
simplification because the AioContext locking rules were sometimes tricky or
underspecified, leading to many bugs of the years.

This patch series removes these APIs across the codebase and cleans up the
documentation/comments that refers to them.

Patch 1 is a AioContext lock user I forgot to convert in my earlier SCSI
conversion series.

Patch 2 removes tests for the AioContext lock because they will no longer be
needed when the lock is gone.

Patches 3-9 remove the AioContext lock. These can be reviewed by categorizing
the call sites into 1. places that take the lock because they call an API that
requires the lock (ultimately AIO_POLL_WHILE()) and 2. places that take the
lock to protect state. There should be no instances of case 2 left. If you see
one, you've found a bug in this patch series!

Patches 10-12 remove comments.

Based-on: 20231123194931.171598-1-stefa...@redhat.com ("[PATCH 0/4] scsi: 
eliminate AioContext lock")
Since SCSI needs to stop relying on the AioContext lock before we can remove
the lock.

Stefan Hajnoczi (12):
  virtio-scsi: replace AioContext lock with tmf_bh_lock
  tests: remove aio_context_acquire() tests
  aio: make aio_context_acquire()/aio_context_release() a no-op
  graph-lock: remove AioContext locking
  block: remove AioContext locking
  scsi: remove AioContext locking
  aio-wait: draw equivalence between AIO_WAIT_WHILE() and
AIO_WAIT_WHILE_UNLOCKED()
  aio: remove aio_context_acquire()/aio_context_release() API
  docs: remove AioContext lock from IOThread docs
  scsi: remove outdated AioContext lock comment
  job: remove outdated AioContext locking comments
  block: remove outdated AioContext locking comments

 docs/devel/multiple-iothreads.txt|  45 ++--
 include/block/aio-wait.h |  16 +-
 include/block/aio.h  |  17 --
 include/block/block-common.h |   3 -
 include/block/block-global-state.h   |   9 +-
 include/block/block-io.h |  12 +-
 include/block/block_int-common.h |   2 -
 include/block/graph-lock.h   |  21 +-
 include/block/snapshot.h |   2 -
 include/hw/virtio/virtio-scsi.h  |  17 +-
 include/qemu/job.h   |  20 --
 block.c  | 357 ---
 block/backup.c   |   4 +-
 block/blklogwrites.c |   8 +-
 block/blkverify.c|   4 +-
 block/block-backend.c|  33 +--
 block/commit.c   |  16 +-
 block/copy-before-write.c|  22 +-
 block/export/export.c|  22 +-
 block/export/vhost-user-blk-server.c |   4 -
 block/graph-lock.c   |  44 +---
 block/io.c   |  45 +---
 block/mirror.c   |  41 +--
 block/monitor/bitmap-qmp-cmds.c  |  20 +-
 block/monitor/block-hmp-cmds.c   |  29 ---
 block/qapi-sysemu.c  |  27 +-
 block/qapi.c |  18 +-
 block/qcow2.c|   4 +-
 block/quorum.c   |   8 +-
 block/raw-format.c   |   5 -
 block/replication.c  |  72 +-
 block/snapshot.c |  26 +-
 block/stream.c   |  12 +-
 block/vmdk.c |  20 +-
 block/write-threshold.c  |   6 -
 blockdev.c   | 315 +--
 blockjob.c   |  30 +--
 hw/block/dataplane/virtio-blk.c  |  10 -
 hw/block/dataplane/xen-block.c   |  17 +-
 hw/block/virtio-blk.c|  45 +---
 hw/core/qdev-properties-system.c |   9 -
 hw/scsi/scsi-bus.c   |   2 -
 hw/scsi/scsi-disk.c  |  29 +--
 hw/scsi/virtio-scsi.c|  80 +++---
 job.c|  16 --
 migration/block.c|  33 +--
 migration/migration-hmp-cmds.c   |   3 -
 migration/savevm.c   |  22 --
 net/colo-compare.c   |   2 -
 qemu-img.c   |   4 -
 qemu-io.c|  10 +-
 qemu-nbd.c   |   2 -
 replay/replay-debugging.c|   4 -
 tests/unit/test-aio.c|  67 +
 tests/unit/test-bdrv-drain.c |  91 ++-
 tests/unit/test-bdrv-graph-mod.c |  26 +-
 tests/unit/test-block-iothread.c |  31 ---
 tests/unit/test-blockjob.c   | 137 --
 tests/unit/test-replication.c|  11 -
 util/async.c |  14 --
 util/vhost-user-server.c

[PATCH 08/12] aio: remove aio_context_acquire()/aio_context_release() API

2023-11-29 Thread Stefan Hajnoczi

Delete these functions because nothing calls these functions anymore.

I introduced these APIs in commit 98563fc3ec44 ("aio: add
aio_context_acquire() and aio_context_release()") in 2014. It's with a
sigh of relief that I delete these APIs almost 10 years later.

Thanks to Paolo Bonzini's vision for multi-queue QEMU, we got an
understanding of where the code needed to go in order to remove the
limitations that the original dataplane and the IOThread/AioContext
approach that followed it.

Emanuele Giuseppe Esposito had the splendid determination to convert
large parts of the codebase so that they no longer needed the AioContext
lock. This was a painstaking process, both in the actual code changes
required and the iterations of code review that Emanuele eeked out of
Kevin and me over many months.

Kevin Wolf tackled multitudes of graph locking conversions to protect
in-flight I/O from run-time changes to the block graph as well as the
clang Thread Safety Analysis annotations that allow the compiler to
check whether the graph lock is being used correctly.

And me, well, I'm just here to add some pizzazz to the QEMU multi-queue
block layer :). Thank you to everyone who helped with this effort,
including Eric Blake, code reviewer extraordinaire, and others who I've
forgotten to mention.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h | 17 -
 util/async.c| 10 --
 2 files changed, 27 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index f08b358077..af05512a7d 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -278,23 +278,6 @@ void aio_context_ref(AioContext *ctx);
  */
 void aio_context_unref(AioContext *ctx);
 
-/* Take ownership of the AioContext.  If the AioContext will be shared between
- * threads, and a thread does not want to be interrupted, it will have to
- * take ownership around calls to aio_poll().  Otherwise, aio_poll()
- * automatically takes care of calling aio_context_acquire and
- * aio_context_release.
- *
- * Note that this is separate from bdrv_drained_begin/bdrv_drained_end.  A
- * thread still has to call those to avoid being interrupted by the guest.
- *
- * Bottom halves, timers and callbacks can be created or removed without
- * acquiring the AioContext.
- */
-void aio_context_acquire(AioContext *ctx);
-
-/* Relinquish ownership of the AioContext. */
-void aio_context_release(AioContext *ctx);
-
 /**
  * aio_bh_schedule_oneshot_full: Allocate a new bottom half structure that will
  * run only once and as soon as possible.
diff --git a/util/async.c b/util/async.c
index dfd44ef612..460529057c 100644
--- a/util/async.c
+++ b/util/async.c
@@ -719,16 +719,6 @@ void aio_context_unref(AioContext *ctx)
 g_source_unref(&ctx->source);
 }
 
-void aio_context_acquire(AioContext *ctx)
-{
-/* TODO remove this function */
-}
-
-void aio_context_release(AioContext *ctx)
-{
-/* TODO remove this function */
-}
-
 QEMU_DEFINE_STATIC_CO_TLS(AioContext *, my_aiocontext)
 
 AioContext *qemu_get_current_aio_context(void)
-- 
2.42.0

1 2 3 4 >

1 - 100 of 335 matches

Mail list logo