Re: [PATCH] vvfat: fix ubsan issue in create_long_filename

2024-12-16 Thread Pierrick Bouvier

Hi everyone,

gentle ping on this series.

On 12/4/24 11:51, Pierrick Bouvier wrote:

Found with test sbsaref introduced in [1].

[1] 
https://patchew.org/QEMU/20241203213629.2482806-1-pierrick.bouv...@linaro.org/

../block/vvfat.c:433:24: runtime error: index 14 out of bounds for type 
'uint8_t [11]'
 #0 0x56151a66b93a in create_long_filename ../block/vvfat.c:433
 #1 0x56151a66f3d7 in create_short_and_long_name ../block/vvfat.c:725
 #2 0x56151a670403 in read_directory ../block/vvfat.c:804
 #3 0x56151a674432 in init_directories ../block/vvfat.c:964
 #4 0x56151a67867b in vvfat_open ../block/vvfat.c:1258
 #5 0x56151a3b8e19 in bdrv_open_driver ../block.c:1660
 #6 0x56151a3bb666 in bdrv_open_common ../block.c:1985
 #7 0x56151a3cadb9 in bdrv_open_inherit ../block.c:4153
 #8 0x56151a3c8850 in bdrv_open_child_bs ../block.c:3731
 #9 0x56151a3ca832 in bdrv_open_inherit ../block.c:4098
 #10 0x56151a3cbe40 in bdrv_open ../block.c:4248
 #11 0x56151a46344f in blk_new_open ../block/block-backend.c:457
 #12 0x56151a388bd9 in blockdev_init ../blockdev.c:612
 #13 0x56151a38ab2d in drive_new ../blockdev.c:1006
 #14 0x5615190fca41 in drive_init_func ../system/vl.c:649
 #15 0x56151aa796dd in qemu_opts_foreach ../util/qemu-option.c:1135
 #16 0x5615190fd2b6 in configure_blockdev ../system/vl.c:708
 #17 0x56151910a307 in qemu_create_early_backends ../system/vl.c:2004
 #18 0x561519113fcf in qemu_init ../system/vl.c:3685
 #19 0x56151a7e438e in main ../system/main.c:47
 #20 0x7f72d1a46249 in __libc_start_call_main 
../sysdeps/nptl/libc_start_call_main.h:58
 #21 0x7f72d1a46304 in __libc_start_main_impl ../csu/libc-start.c:360
 #22 0x561517e98510 in _start 
(/home/user/.work/qemu/build/qemu-system-aarch64+0x3b9b510)

The offset used can easily go beyond entry->name size. It's probably a
bug, but I don't have the time to dive into vfat specifics for now.

This change solves the ubsan issue, and is functionally equivalent, as
anything written past the entry->name array would not be read anyway.

Signed-off-by: Pierrick Bouvier 
---
  block/vvfat.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/block/vvfat.c b/block/vvfat.c
index 8ffe8b3b9bf..f2eafaa9234 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -426,6 +426,10 @@ static direntry_t *create_long_filename(BDRVVVFATState *s, 
const char *filename)
  else if(offset<22) offset=14+offset-10;
  else offset=28+offset-22;
  entry=array_get(&(s->directory),s->directory.next-1-(i/26));
+/* ensure we don't write anything past entry->name */
+if (offset >= sizeof(entry->name)) {
+continue;
+}
  if (i >= 2 * length + 2) {
  entry->name[offset] = 0xff;
  } else if (i % 2 == 0) {





Re: [PATCH v2 6/6] migration/block: Rewrite disk activation

2024-12-16 Thread Peter Xu
On Mon, Dec 16, 2024 at 12:58:58PM -0300, Fabiano Rosas wrote:
> > @@ -3286,6 +3237,11 @@ static void 
> > migration_iteration_finish(MigrationState *s)
> >  case MIGRATION_STATUS_FAILED:
> >  case MIGRATION_STATUS_CANCELLED:
> >  case MIGRATION_STATUS_CANCELLING:
> 
> Pre-existing, but can we even reach here with CANCELLED? If we can start
> the VM with both CANCELLED and CANCELLING, that means the
> MIG_EVENT_PRECOPY_FAILED notifier is not being consistently called. So I
> think CANCELLED here must be unreachable...

Yeah I can't see how it's reachable, because the only place that we can set
it to CANCELLED is:

migrate_fd_cleanup():
if (s->state == MIGRATION_STATUS_CANCELLING) {
migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
  MIGRATION_STATUS_CANCELLED);
}

In this specific context, it (as a bottom half) can only be scheduled after
this path.

Looks like the event MIG_EVENT_PRECOPY_FAILED will work regardless, though?
As that's also done after above:

type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED :
 MIG_EVENT_PRECOPY_DONE;
migration_call_notifiers(s, type, NULL);

So looks like no matter it was CANCELLING or CANCELLED, it'll always be
CANCELLED when reaching migration_has_failed().

[...]

> > @@ -103,13 +104,8 @@ void qmp_cont(Error **errp)
> >   * Continuing after completed migration. Images have been
> >   * inactivated to allow the destination to take control. Need to
> >   * get control back now.
> > - *
> > - * If there are no inactive block nodes (e.g. because the VM was
> > - * just paused rather than completing a migration),
> > - * bdrv_inactivate_all() simply doesn't do anything.
> >   */
> > -bdrv_activate_all(&local_err);
> > -if (local_err) {
> > +if (!migration_block_activate(&local_err)) {
> >  error_propagate(errp, local_err);
> 
> Could use errp directly here.

True..

-- 
Peter Xu




Re: libnfs 6.0.0 breaks qemu compilation

2024-12-16 Thread Xavier Bachelot

Le 2024-12-16 11:43, Daniel P. Berrangé a écrit :

On Sat, Dec 14, 2024 at 01:40:45PM +0100, Paolo Bonzini wrote:
Il ven 13 dic 2024, 20:19 Richard W.M. Jones  ha 
scritto:


> On Fri, Dec 13, 2024 at 07:37:10PM +0100, Paolo Bonzini wrote:
> > Yeah, and I don't think it should be merged, unless libnfs support is
> dropped
> > from the QEMU build in rawhide.
>
> Sure if there's no easy fix on the horizon, we can remove libnfs
> support temporarily.
>

Can we just keep the old libnfs indefinitely? Are there any killer 
features

for dependencies other than QEMU?


QEMU needs fixing to work with both old and new libnfs. Even if
Fedora pinned to old libnfs, we can be sure that sooner or later
the new libnfs will appear in other distros and thus will inevitably
need to deal with this incompatibility.



To answer a previous question, there's no hurry in switching to the new 
libnfs 6, especially as all affected code will need to be patched to 
acknowledge for the new API. However, it will indeed need to be fixed 
sooner or later, I don't think upstream will maintain the old libnfs 5 
branch.


Regards,
Xavier



Re: [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed

2024-12-16 Thread Fabiano Rosas
Peter Xu  writes:

> As the comment says, the activation of disks is for the case where
> migration has completed, rather than when QEMU is still during
> migration (RUN_STATE_INMIGRATE).
>
> Move the code over to reflect what the comment is describing.
>
> Cc: Kevin Wolf 
> Cc: Markus Armbruster 
> Signed-off-by: Peter Xu 

Reviewed-by: Fabiano Rosas 



Re: [PATCH v2 3/6] migration/block: Make late-block-active the default

2024-12-16 Thread Fabiano Rosas
Peter Xu  writes:

> Migration capability 'late-block-active' controls when the block drives
> will be activated.  If enabled, block drives will only be activated until
> VM starts, either src runstate was "live" (RUNNING, or SUSPENDED), or it'll
> be postponed until qmp_cont().
>
> Let's do this unconditionally.  There's no harm to delay activation of
> block drives.  Meanwhile there's no ABI breakage if dest does it, because
> src QEMU has nothing to do with it, so it's no concern on ABI breakage.
>
> IIUC we could avoid introducing this cap when introducing it before, but
> now it's still not too late to just always do it.  Cap now prone to
> removal, but it'll be for later patches.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Fabiano Rosas 



Re: [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy

2024-12-16 Thread Fabiano Rosas
Peter Xu  writes:

> Postcopy never cared about late-block-active.  However there's no mention
> in the capability that it doesn't apply to postcopy.
>
> Considering that we _assumed_ late activation is always good, do that too
> for postcopy unconditionally, just like precopy.  After this patch, we
> should have unified the behavior across all.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Fabiano Rosas 



Re: [PATCH v2 5/6] migration/block: Fix possible race with block_inactive

2024-12-16 Thread Fabiano Rosas
Peter Xu  writes:

> Src QEMU sets block_inactive=true very early before the invalidation takes
> place.  It means if something wrong happened during setting the flag but
> before reaching qemu_savevm_state_complete_precopy_non_iterable() where it
> did the invalidation work, it'll make block_inactive flag inconsistent.
>
> For example, think about when qemu_savevm_state_complete_precopy_iterable()
> can fail: it will have block_inactive set to true even if all block drives
> are active.
>
> Fix that by only update the flag after the invalidation is done.
>
> No Fixes for any commit, because it's not an issue if bdrv_activate_all()
> is re-entrant upon all-active disks - false positive block_inactive can
> bring nothing more than "trying to active the blocks but they're already
> active".  However let's still do it right to avoid the inconsistent flag
> v.s. reality.
>
> Signed-off-by: Peter Xu 

Reviewed-by: Fabiano Rosas 



[PATCH 1/9] hw/nvme: always initialize a subsystem

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

If no nvme-subsys is explicitly configured, instantiate one.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c | 36 ++---
 hw/nvme/ns.c   | 64 --
 2 files changed, 42 insertions(+), 58 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
ec754195669598082dd573ff1237dbbfb9b8aff5..e1d54ee34d2cd073821ac398bc5b4a51cb79e0e9
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8759,15 +8759,13 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->psd[0].enlat = cpu_to_le32(0x10);
 id->psd[0].exlat = cpu_to_le32(0x4);
 
-if (n->subsys) {
-id->cmic |= NVME_CMIC_MULTI_CTRL;
-ctratt |= NVME_CTRATT_ENDGRPS;
+id->cmic |= NVME_CMIC_MULTI_CTRL;
+ctratt |= NVME_CTRATT_ENDGRPS;
 
-id->endgidmax = cpu_to_le16(0x1);
+id->endgidmax = cpu_to_le16(0x1);
 
-if (n->subsys->endgrp.fdp.enabled) {
-ctratt |= NVME_CTRATT_FDPS;
-}
+if (n->subsys->endgrp.fdp.enabled) {
+ctratt |= NVME_CTRATT_FDPS;
 }
 
 id->ctratt = cpu_to_le32(ctratt);
@@ -8796,7 +8794,15 @@ static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
 int cntlid;
 
 if (!n->subsys) {
-return 0;
+DeviceState *dev = qdev_new(TYPE_NVME_SUBSYS);
+
+qdev_prop_set_string(dev, "nqn", n->params.serial);
+
+if (!qdev_realize(dev, NULL, errp)) {
+return -1;
+}
+
+n->subsys = NVME_SUBSYS(dev);
 }
 
 cntlid = nvme_subsys_register_ctrl(n, errp);
@@ -8886,17 +8892,15 @@ static void nvme_exit(PCIDevice *pci_dev)
 
 nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
 
-if (n->subsys) {
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-ns = nvme_ns(n, i);
-if (ns) {
-ns->attached--;
-}
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
 }
-
-nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 
+nvme_subsys_unregister_ctrl(n->subsys, n);
+
 g_free(n->cq);
 g_free(n->sq);
 g_free(n->aer_reqs);
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 
526e15aa80187b82f9622445849b03fd472da8df..3be43503c50798db0ab528fe30ad901bb6aa9db3
 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -727,25 +727,14 @@ static void nvme_ns_realize(DeviceState *dev, Error 
**errp)
 uint32_t nsid = ns->params.nsid;
 int i;
 
-if (!n->subsys) {
-/* If no subsys, the ns cannot be attached to more than one ctrl. */
-ns->params.shared = false;
-if (ns->params.detached) {
-error_setg(errp, "detached requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return;
-}
-} else {
-/*
- * If this namespace belongs to a subsystem (through a link on the
- * controller device), reparent the device.
- */
-if (!qdev_set_parent_bus(dev, &subsys->bus.parent_bus, errp)) {
-return;
-}
-ns->subsys = subsys;
-ns->endgrp = &subsys->endgrp;
+assert(subsys);
+
+/* reparent to subsystem bus */
+if (!qdev_set_parent_bus(dev, &subsys->bus.parent_bus, errp)) {
+return;
 }
+ns->subsys = subsys;
+ns->endgrp = &subsys->endgrp;
 
 if (nvme_ns_setup(ns, errp)) {
 return;
@@ -753,7 +742,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 
 if (!nsid) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+if (nvme_subsys_ns(subsys, i)) {
 continue;
 }
 
@@ -765,38 +754,29 @@ static void nvme_ns_realize(DeviceState *dev, Error 
**errp)
 error_setg(errp, "no free namespace id");
 return;
 }
-} else {
-if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
-error_setg(errp, "namespace id '%d' already allocated", nsid);
-return;
-}
+} else if (nvme_subsys_ns(subsys, nsid)) {
+error_setg(errp, "namespace id '%d' already allocated", nsid);
+return;
 }
 
-if (subsys) {
-subsys->namespaces[nsid] = ns;
+subsys->namespaces[nsid] = ns;
 
-ns->id_ns.endgid = cpu_to_le16(0x1);
-ns->id_ns_ind.endgrpid = cpu_to_le16(0x1);
+ns->id_ns.endgid = cpu_to_le16(0x1);
+ns->id_ns_ind.endgrpid = cpu_to_le16(0x1);
 
-if (ns->params.detached) {
-return;
-}
+if (ns->params.detached) {
+return;
+}
 
-if (ns->params.shared) {
-for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
-NvmeCtrl *ctrl = subsys->ctrls[i];
+if (ns->params.shared) {
+for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) {
+NvmeCtrl *ctrl = subsys->ctrls[i];
 
-i

[PATCH 6/9] hw/nvme: rework csi handling

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

The controller incorrectly allows a zoned namespace to be attached even
if CS.CSS is configured to only support the NVM command set for I/O
queues.

Rework handling of namespace command sets in general by attaching
supported namespaces when the controller is started instead of, like
now, statically when realized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 242 ---
 hw/nvme/ns.c |  14 ---
 hw/nvme/nvme.h   |   5 +-
 include/block/nvme.h |  10 ++-
 4 files changed, 142 insertions(+), 129 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
22a8c400bae332285d015e8b590de159fd7d1b7a..8d3f62c40ac14fdc4bdc650e272023558cbbae0f
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -277,36 +277,32 @@ static const uint32_t nvme_cse_acs_default[256] = {
 [NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
-[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC |
+  NVME_CMD_EFF_CCC,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_ADM_CMD_DIRECTIVE_RECV]   = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_DIRECTIVE_SEND]   = NVME_CMD_EFF_CSUPP,
 };
 
-static const uint32_t nvme_cse_iocs_none[256];
-
-static const uint32_t nvme_cse_iocs_nvm[256] = {
-[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP,
-[NVME_CMD_DSM]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_VERIFY]   = NVME_CMD_EFF_CSUPP,
-[NVME_CMD_COPY] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_COMPARE]  = NVME_CMD_EFF_CSUPP,
-[NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP,
-[NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
+#define NVME_CSE_IOCS_NVM_DEFAULT \
+[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \
+[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \
+[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \
+[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP, \
+[NVME_CMD_DSM]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \
+[NVME_CMD_VERIFY]   = NVME_CMD_EFF_CSUPP, \
+[NVME_CMD_COPY] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \
+[NVME_CMD_COMPARE]  = NVME_CMD_EFF_CSUPP, \
+[NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP, \
+[NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC
+
+static const uint32_t nvme_cse_iocs_nvm_default[256] = {
+NVME_CSE_IOCS_NVM_DEFAULT,
 };
 
-static const uint32_t nvme_cse_iocs_zoned[256] = {
-[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP,
-[NVME_CMD_DSM]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_VERIFY]   = NVME_CMD_EFF_CSUPP,
-[NVME_CMD_COPY] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
-[NVME_CMD_COMPARE]  = NVME_CMD_EFF_CSUPP,
+static const uint32_t nvme_cse_iocs_zoned_default[256] = {
+NVME_CSE_IOCS_NVM_DEFAULT,
+
 [NVME_CMD_ZONE_APPEND]  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_CMD_ZONE_MGMT_SEND]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_CMD_ZONE_MGMT_RECV]   = NVME_CMD_EFF_CSUPP,
@@ -4603,6 +4599,61 @@ static uint16_t nvme_io_mgmt_send(NvmeCtrl *n, 
NvmeRequest *req)
 };
 }
 
+static uint16_t __nvme_io_cmd_nvm(NvmeCtrl *n, NvmeRequest *req)
+{
+switch (req->cmd.opcode) {
+case NVME_CMD_WRITE:
+return nvme_write(n, req);
+case NVME_CMD_READ:
+return nvme_read(n, req);
+case NVME_CMD_COMPARE:
+return nvme_compare(n, req);
+case NVME_CMD_WRITE_ZEROES:
+return nvme_write_zeroes(n, req);
+case NVME_CMD_DSM:
+return nvme_dsm(n, req);
+case NVME_CMD_VERIFY:
+return nvme_verify(n, req);
+case NVME_CMD_COPY:
+return nvme_copy(n, req);
+case NVME_CMD_IO_MGMT_RECV:
+return nvme_io_mgmt_recv(n, req);
+case NVME_CMD_IO_MGMT_SEND:
+return nvme_io_mgmt_send(n, req);
+}
+
+g_assert_not_reached();
+}
+
+static uint16_

[PATCH 4/9] nvme: fix iocs status code values

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

The status codes related to I/O Command Sets are in the wrong group.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 4 ++--
 include/block/nvme.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
d544789f92ffe6b758ce35cecfc025d87efb9b7e..120a1ca1076c8110d8550a5e75082c6ed4f23e16
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5623,7 +5623,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest 
*req, bool active)
 return nvme_c2h(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs), req);
 }
 
-return NVME_INVALID_CMD_SET | NVME_DNR;
+return NVME_INVALID_IOCS | NVME_DNR;
 }
 
 static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req,
@@ -6589,7 +6589,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest 
*req)
 case NVME_COMMAND_SET_PROFILE:
 if (dw11 & 0x1ff) {
 trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff);
-return NVME_CMD_SET_CMB_REJECTED | NVME_DNR;
+return NVME_IOCS_COMBINATION_REJECTED | NVME_DNR;
 }
 break;
 case NVME_FDP_MODE:
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 
a68a07455d0330b8f7cc283da0a5eadbcc140dab..145a0b65933a699504d6d89222f7979a06f615df
 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -906,8 +906,6 @@ enum NvmeStatusCodes {
 NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
 NVME_INVALID_USE_OF_CMB = 0x0012,
 NVME_INVALID_PRP_OFFSET = 0x0013,
-NVME_CMD_SET_CMB_REJECTED   = 0x002b,
-NVME_INVALID_CMD_SET= 0x002c,
 NVME_FDP_DISABLED   = 0x0029,
 NVME_INVALID_PHID_LIST  = 0x002a,
 NVME_LBA_RANGE  = 0x0080,
@@ -940,6 +938,8 @@ enum NvmeStatusCodes {
 NVME_INVALID_SEC_CTRL_STATE = 0x0120,
 NVME_INVALID_NUM_RESOURCES  = 0x0121,
 NVME_INVALID_RESOURCE_ID= 0x0122,
+NVME_IOCS_COMBINATION_REJECTED = 0x012b,
+NVME_INVALID_IOCS   = 0x012c,
 NVME_CONFLICTING_ATTRS  = 0x0180,
 NVME_INVALID_PROT_INFO  = 0x0181,
 NVME_WRITE_TO_RO= 0x0182,

-- 
2.45.2




[PATCH 8/9] hw/nvme: set error status code explicitly for misc commands

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

The nvme_aio_err() does not handle Verify, Compare, Copy and other misc
commands and defaults to setting the error status code to Internal
Device Error. For some of these commands, we know better, so set it
explicitly.

For the commands using the nvme_misc_cb() callback (Copy, Flush, ...),
if no status code has explicitly been set by the lower handlers, default
to Internal Device Error as previously.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 28 ++--
 include/block/nvme.h |  1 +
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
5b1bac020f049cc2a2f869b12e1d2a7e13cef316..8192f92227d6509b8d15fde9d9197a59277eb86f
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1765,7 +1765,6 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
 case NVME_CMD_READ:
 status = NVME_UNRECOVERED_READ;
 break;
-case NVME_CMD_FLUSH:
 case NVME_CMD_WRITE:
 case NVME_CMD_WRITE_ZEROES:
 case NVME_CMD_ZONE_APPEND:
@@ -2151,11 +2150,16 @@ static inline bool nvme_is_write(NvmeRequest *req)
 static void nvme_misc_cb(void *opaque, int ret)
 {
 NvmeRequest *req = opaque;
+uint16_t cid = nvme_cid(req);
 
-trace_pci_nvme_misc_cb(nvme_cid(req));
+trace_pci_nvme_misc_cb(cid);
 
 if (ret) {
-nvme_aio_err(req, ret);
+if (!req->status) {
+req->status = NVME_INTERNAL_DEV_ERROR;
+}
+
+trace_pci_nvme_err_aio(cid, strerror(-ret), req->status);
 }
 
 nvme_enqueue_req_completion(nvme_cq(req), req);
@@ -2258,7 +2262,10 @@ static void nvme_verify_cb(void *opaque, int ret)
 
 if (ret) {
 block_acct_failed(stats, acct);
-nvme_aio_err(req, ret);
+req->status = NVME_UNRECOVERED_READ;
+
+trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status);
+
 goto out;
 }
 
@@ -2357,7 +2364,10 @@ static void nvme_compare_mdata_cb(void *opaque, int ret)
 
 if (ret) {
 block_acct_failed(stats, acct);
-nvme_aio_err(req, ret);
+req->status = NVME_UNRECOVERED_READ;
+
+trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status);
+
 goto out;
 }
 
@@ -2439,7 +2449,10 @@ static void nvme_compare_data_cb(void *opaque, int ret)
 
 if (ret) {
 block_acct_failed(stats, acct);
-nvme_aio_err(req, ret);
+req->status = NVME_UNRECOVERED_READ;
+
+trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status);
+
 goto out;
 }
 
@@ -2918,6 +2931,7 @@ static void nvme_copy_out_completed_cb(void *opaque, int 
ret)
 
 if (ret < 0) {
 iocb->ret = ret;
+req->status = NVME_WRITE_FAULT;
 goto out;
 } else if (iocb->ret < 0) {
 goto out;
@@ -2982,6 +2996,7 @@ static void nvme_copy_in_completed_cb(void *opaque, int 
ret)
 
 if (ret < 0) {
 iocb->ret = ret;
+req->status = NVME_UNRECOVERED_READ;
 goto out;
 } else if (iocb->ret < 0) {
 goto out;
@@ -3504,6 +3519,7 @@ static void nvme_flush_ns_cb(void *opaque, int ret)
 
 if (ret < 0) {
 iocb->ret = ret;
+iocb->req->status = NVME_WRITE_FAULT;
 goto out;
 } else if (iocb->ret < 0) {
 goto out;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 
66d49b641aa1e89c12103e548320d89995fbbfae..3c8a9ba3c7956c1d475857a1068074338643f77f
 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -906,6 +906,7 @@ enum NvmeStatusCodes {
 NVME_SGL_DESCR_TYPE_INVALID = 0x0011,
 NVME_INVALID_USE_OF_CMB = 0x0012,
 NVME_INVALID_PRP_OFFSET = 0x0013,
+NVME_COMMAND_INTERRUPTED= 0x0021,
 NVME_FDP_DISABLED   = 0x0029,
 NVME_INVALID_PHID_LIST  = 0x002a,
 NVME_LBA_RANGE  = 0x0080,

-- 
2.45.2




[PATCH 2/9] hw/nvme: make oacs dynamic

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

Virtualization Management needs sriov-related parameters. Only report
support for the command when that conditions are true.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 25 ++---
 hw/nvme/nvme.h   |  4 
 include/block/nvme.h |  3 ++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
e1d54ee34d2cd073821ac398bc5b4a51cb79e0e9..0e95c07c5314fa33674963ef2cea74c78954e86b
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -266,7 +266,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_FDP_EVENTS]   = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS,
 };
 
-static const uint32_t nvme_cse_acs[256] = {
+static const uint32_t nvme_cse_acs_default[256] = {
 [NVME_ADM_CMD_DELETE_SQ]= NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_CREATE_SQ]= NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFF_CSUPP,
@@ -278,7 +278,6 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
-[NVME_ADM_CMD_VIRT_MNGMT]   = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_ADM_CMD_DIRECTIVE_RECV]   = NVME_CMD_EFF_CSUPP,
@@ -5135,7 +5134,7 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t 
csi, uint32_t buf_len,
 }
 }
 
-memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
+memcpy(log.acs, n->cse.acs, sizeof(log.acs));
 
 if (src_iocs) {
 memcpy(log.iocs, src_iocs, sizeof(log.iocs));
@@ -7242,7 +7241,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
  nvme_adm_opc_str(req->cmd.opcode));
 
-if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
+if (!(n->cse.acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
 trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
 return NVME_INVALID_OPCODE | NVME_DNR;
 }
@@ -8676,6 +8675,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 uint64_t cap = ldq_le_p(&n->bar.cap);
 NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 uint32_t ctratt;
+uint16_t oacs;
+
+memcpy(n->cse.acs, nvme_cse_acs_default, sizeof(n->cse.acs));
 
 id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
 id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
@@ -8706,9 +8708,18 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 
 id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
-id->oacs =
-cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF |
-NVME_OACS_DIRECTIVES);
+
+oacs = NVME_OACS_NMS | NVME_OACS_FORMAT | NVME_OACS_DBBUF |
+NVME_OACS_DIRECTIVES;
+
+if (n->params.sriov_max_vfs) {
+oacs |= NVME_OACS_VMS;
+
+n->cse.acs[NVME_ADM_CMD_VIRT_MNGMT] = NVME_CMD_EFF_CSUPP;
+}
+
+id->oacs = cpu_to_le16(oacs);
+
 id->cntrltype = 0x1;
 
 /*
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 
724220691057063a972be2a0ada44d2164266144..191b6c5398d0c4583051a6a9773c677a49caffd6
 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -583,6 +583,10 @@ typedef struct NvmeCtrl {
 uint64_tdbbuf_eis;
 booldbbuf_enabled;
 
+struct {
+uint32_t acs[256];
+} cse;
+
 struct {
 MemoryRegion mem;
 uint8_t  *buf;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 
f4d108841bf595f1176e9cb2c910e230181c67f6..9ebee69369d6bfa6835154a91b2bdaaf7984bf0c
 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1192,8 +1192,9 @@ enum NvmeIdCtrlOacs {
 NVME_OACS_SECURITY  = 1 << 0,
 NVME_OACS_FORMAT= 1 << 1,
 NVME_OACS_FW= 1 << 2,
-NVME_OACS_NS_MGMT   = 1 << 3,
+NVME_OACS_NMS   = 1 << 3,
 NVME_OACS_DIRECTIVES= 1 << 5,
+NVME_OACS_VMS   = 1 << 7,
 NVME_OACS_DBBUF = 1 << 8,
 };
 

-- 
2.45.2




[PATCH 7/9] hw/nvme: only set command abort requested when cancelled due to Abort

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

The Command Abort Requested status code should only be set if the
command was explicitly cancelled due to an Abort command. Or, in the
case the cancel was due to Submission Queue deletion, set the status
code to Command Aborted due to SQ Deletion.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
8d3f62c40ac14fdc4bdc650e272023558cbbae0f..5b1bac020f049cc2a2f869b12e1d2a7e13cef316
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1777,10 +1777,6 @@ static void nvme_aio_err(NvmeRequest *req, int ret)
 break;
 }
 
-if (ret == -ECANCELED) {
-status = NVME_CMD_ABORT_REQ;
-}
-
 trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), status);
 
 error_setg_errno(&local_err, -ret, "aio failed");
@@ -4821,6 +4817,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
 while (!QTAILQ_EMPTY(&sq->out_req_list)) {
 r = QTAILQ_FIRST(&sq->out_req_list);
 assert(r->aiocb);
+r->status = NVME_CMD_ABORT_SQ_DEL;
 blk_aio_cancel(r->aiocb);
 }
 
@@ -6073,6 +6070,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
 QTAILQ_FOREACH_SAFE(r, &sq->out_req_list, entry, next) {
 if (r->cqe.cid == cid) {
 if (r->aiocb) {
+r->status = NVME_CMD_ABORT_REQ;
 blk_aio_cancel_async(r->aiocb);
 }
 break;

-- 
2.45.2




[PATCH 5/9] hw/nvme: be compliant wrt. dsm processing limits

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

The specification states that,

> The controller shall set all three processing limit fields (i.e., the
> DMRL, DMRSL and DMSL fields) to non-zero values or shall clear all
> three processing limit fields to 0h.

So, set the DMRL and DMSL fields in addition to DMRSL.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 24 +++-
 include/block/nvme.h |  2 ++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
120a1ca1076c8110d8550a5e75082c6ed4f23e16..22a8c400bae332285d015e8b590de159fd7d1b7a
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -5581,7 +5581,9 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, 
NvmeRequest *req)
 switch (c->csi) {
 case NVME_CSI_NVM:
 id_nvm->vsl = n->params.vsl;
+id_nvm->dmrl = NVME_ID_CTRL_NVM_DMRL_MAX;
 id_nvm->dmrsl = cpu_to_le32(n->dmrsl);
+id_nvm->dmsl = NVME_ID_CTRL_NVM_DMRL_MAX * n->dmrsl;
 break;
 
 case NVME_CSI_ZONED:
@@ -6638,18 +6640,23 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
-static void nvme_update_dmrsl(NvmeCtrl *n)
+static void nvme_update_dsm_limits(NvmeCtrl *n, NvmeNamespace *ns)
 {
-int nsid;
+if (ns) {
+n->dmrsl =
+MIN_NON_ZERO(n->dmrsl, BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1));
 
-for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) {
-NvmeNamespace *ns = nvme_ns(n, nsid);
+return;
+}
+
+for (uint32_t nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) {
+ns = nvme_ns(n, nsid);
 if (!ns) {
 continue;
 }
 
-n->dmrsl = MIN_NON_ZERO(n->dmrsl,
-BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1));
+n->dmrsl =
+MIN_NON_ZERO(n->dmrsl, BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1));
 }
 }
 
@@ -6737,7 +6744,7 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, 
NvmeRequest *req)
 ctrl->namespaces[nsid] = NULL;
 ns->attached--;
 
-nvme_update_dmrsl(ctrl);
+nvme_update_dsm_limits(ctrl, NULL);
 
 break;
 
@@ -8838,8 +8845,7 @@ void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns)
 n->namespaces[nsid] = ns;
 ns->attached++;
 
-n->dmrsl = MIN_NON_ZERO(n->dmrsl,
-BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1));
+nvme_update_dsm_limits(n, ns);
 }
 
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 
145a0b65933a699504d6d89222f7979a06f615df..f3f0317524d129f518698c6797ed37a7ac0ac847
 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1167,6 +1167,8 @@ typedef struct NvmeIdCtrlZoned {
 uint8_t rsvd1[4095];
 } NvmeIdCtrlZoned;
 
+#define NVME_ID_CTRL_NVM_DMRL_MAX 255
+
 typedef struct NvmeIdCtrlNvm {
 uint8_t vsl;
 uint8_t wzsl;

-- 
2.45.2




[PATCH 3/9] hw/nvme: add knob for doorbell buffer config support

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

Add a 'dbcs' knob to allow Doorbell Buffer Config command to be
disabled.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c   | 11 ---
 hw/nvme/nvme.h   |  1 +
 include/block/nvme.h |  2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
0e95c07c5314fa33674963ef2cea74c78954e86b..d544789f92ffe6b758ce35cecfc025d87efb9b7e
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -278,7 +278,6 @@ static const uint32_t nvme_cse_acs_default[256] = {
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
-[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_FORMAT_NVM]   = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 [NVME_ADM_CMD_DIRECTIVE_RECV]   = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_DIRECTIVE_SEND]   = NVME_CMD_EFF_CSUPP,
@@ -8709,8 +8708,13 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->mdts = n->params.mdts;
 id->ver = cpu_to_le32(NVME_SPEC_VER);
 
-oacs = NVME_OACS_NMS | NVME_OACS_FORMAT | NVME_OACS_DBBUF |
-NVME_OACS_DIRECTIVES;
+oacs = NVME_OACS_NMS | NVME_OACS_FORMAT | NVME_OACS_DIRECTIVES;
+
+if (n->params.dbcs) {
+oacs |= NVME_OACS_DBCS;
+
+n->cse.acs[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP;
+}
 
 if (n->params.sriov_max_vfs) {
 oacs |= NVME_OACS_VMS;
@@ -8960,6 +8964,7 @@ static Property nvme_props[] = {
 DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
 DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false),
 DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false),
+DEFINE_PROP_BOOL("dbcs", NvmeCtrl, params.dbcs, true),
 DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
 DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
  params.auto_transition_zones, true),
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 
191b6c5398d0c4583051a6a9773c677a49caffd6..cb314e91af32a20f47e0a393e2458b7d4bdd03d9
 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -539,6 +539,7 @@ typedef struct NvmeParams {
 bool auto_transition_zones;
 bool legacy_cmb;
 bool ioeventfd;
+bool dbcs;
 uint16_t  sriov_max_vfs;
 uint16_t sriov_vq_flexible;
 uint16_t sriov_vi_flexible;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 
9ebee69369d6bfa6835154a91b2bdaaf7984bf0c..a68a07455d0330b8f7cc283da0a5eadbcc140dab
 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -1195,7 +1195,7 @@ enum NvmeIdCtrlOacs {
 NVME_OACS_NMS   = 1 << 3,
 NVME_OACS_DIRECTIVES= 1 << 5,
 NVME_OACS_VMS   = 1 << 7,
-NVME_OACS_DBBUF = 1 << 8,
+NVME_OACS_DBCS  = 1 << 8,
 };
 
 enum NvmeIdCtrlOncs {

-- 
2.45.2




[PATCH 0/9] hw/nvme: refactor/cleanup

2024-12-16 Thread Klaus Jensen
Apart from some random small fixes here and there, the major thing here
is cleaning up how we handle command sets. Prior to this series, the
controller would not correctly validate namespace command sets against
CC.CSS. This is fixed here.

The most clean way of doing this (as far as I could tell) was to make
sure an nvme-subsys device exists (creating it if necessary). This
allows us to "store" the namespaces in the subsystem, using existing
functionality, and attach supported namespaces when the device is
started (instead of when the device is created/realized).

Signed-off-by: Klaus Jensen 
---
Klaus Jensen (9):
  hw/nvme: always initialize a subsystem
  hw/nvme: make oacs dynamic
  hw/nvme: add knob for doorbell buffer config support
  nvme: fix iocs status code values
  hw/nvme: be compliant wrt. dsm processing limits
  hw/nvme: rework csi handling
  hw/nvme: only set command abort requested when cancelled due to Abort
  hw/nvme: set error status code explicitly for misc commands
  hw/nvme: remove nvme_aio_err()

 hw/nvme/ctrl.c   | 430 ---
 hw/nvme/ns.c |  62 ++--
 hw/nvme/nvme.h   |  10 +-
 include/block/nvme.h |  22 ++-
 4 files changed, 276 insertions(+), 248 deletions(-)
---
base-commit: ca80a5d026a280762e0772615f1988db542b3ade
change-id: 20241216-nvme-queue-f4151c5d7507

Best regards,
-- 
Klaus Jensen 




[PATCH 9/9] hw/nvme: remove nvme_aio_err()

2024-12-16 Thread Klaus Jensen
From: Klaus Jensen 

nvme_rw_complete_cb() is the only remaining user of nvme_aio_err(), so
open code the status code setting instead.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/ctrl.c | 60 ++
 1 file changed, 23 insertions(+), 37 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 
8192f92227d6509b8d15fde9d9197a59277eb86f..40f535b03316ed45f6cbb2894fd89f9ce258423e
 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1756,42 +1756,6 @@ static uint16_t nvme_check_dulbe(NvmeNamespace *ns, 
uint64_t slba,
 return NVME_SUCCESS;
 }
 
-static void nvme_aio_err(NvmeRequest *req, int ret)
-{
-uint16_t status = NVME_SUCCESS;
-Error *local_err = NULL;
-
-switch (req->cmd.opcode) {
-case NVME_CMD_READ:
-status = NVME_UNRECOVERED_READ;
-break;
-case NVME_CMD_WRITE:
-case NVME_CMD_WRITE_ZEROES:
-case NVME_CMD_ZONE_APPEND:
-case NVME_CMD_COPY:
-status = NVME_WRITE_FAULT;
-break;
-default:
-status = NVME_INTERNAL_DEV_ERROR;
-break;
-}
-
-trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), status);
-
-error_setg_errno(&local_err, -ret, "aio failed");
-error_report_err(local_err);
-
-/*
- * Set the command status code to the first encountered error but allow a
- * subsequent Internal Device Error to trump it.
- */
-if (req->status && status != NVME_INTERNAL_DEV_ERROR) {
-return;
-}
-
-req->status = status;
-}
-
 static inline uint32_t nvme_zone_idx(NvmeNamespace *ns, uint64_t slba)
 {
 return ns->zone_size_log2 > 0 ? slba >> ns->zone_size_log2 :
@@ -2176,8 +2140,30 @@ void nvme_rw_complete_cb(void *opaque, int ret)
 trace_pci_nvme_rw_complete_cb(nvme_cid(req), blk_name(blk));
 
 if (ret) {
+Error *err = NULL;
+
 block_acct_failed(stats, acct);
-nvme_aio_err(req, ret);
+
+switch (req->cmd.opcode) {
+case NVME_CMD_READ:
+req->status = NVME_UNRECOVERED_READ;
+break;
+
+case NVME_CMD_WRITE:
+case NVME_CMD_WRITE_ZEROES:
+case NVME_CMD_ZONE_APPEND:
+req->status = NVME_WRITE_FAULT;
+break;
+
+default:
+req->status = NVME_INTERNAL_DEV_ERROR;
+break;
+}
+
+trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status);
+
+error_setg_errno(&err, -ret, "aio failed");
+error_report_err(err);
 } else {
 block_acct_done(stats, acct);
 }

-- 
2.45.2




Re: [PATCH v2 6/6] migration/block: Rewrite disk activation

2024-12-16 Thread Fabiano Rosas
Peter Xu  writes:

> This patch proposes a flag to maintain disk activation status globally.  It
> mostly rewrites disk activation mgmt for QEMU, including COLO and QMP
> command xen_save_devices_state.
>
> Backgrounds
> ===
>
> We have two problems on disk activations, one resolved, one not.
>
> Problem 1: disk activation recover (for switchover interruptions)
> ~
>
> When migration is either cancelled or failed during switchover, especially
> when after the disks are inactivated, QEMU needs to remember re-activate
> the disks again before vm starts.
>
> It used to be done separately in two paths: one in qmp_migrate_cancel(),
> the other one in the failure path of migration_completion().
>
> It used to be fixed in different commits, all over the places in QEMU.  So
> these are the relevant changes I saw, I'm not sure if it's complete list:
>
>  - In 2016, commit fe904ea824 ("migration: regain control of images when
>migration fails to complete")
>
>  - In 2017, commit 1d2acc3162 ("migration: re-active images while migration
>been canceled after inactive them")
>
>  - In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in
>more failure scenarios")
>
> Now since we have a slightly better picture maybe we can unify the
> reactivation in a single path.
>
> One side benefit of doing so is, we can move the disk operation outside QMP
> command "migrate_cancel".  It's possible that in the future we may want to
> make "migrate_cancel" be OOB-compatible, while that requires the command
> doesn't need BQL in the first place.  This will already do that and make
> migrate_cancel command lightweight.
>
> Problem 2: disk invalidation on top of invalidated disks
> 
>
> This is an unresolved bug for current QEMU.  Link in "Resolves:" at the
> end.  It turns out besides the src switchover phase (problem 1 above), QEMU
> also needs to remember block activation on destination.
>
> Consider two continuous migration in a row, where the VM was always paused.
> In that scenario, the disks are not activated even until migration
> completed in the 1st round.  When the 2nd round starts, if QEMU doesn't
> know the status of the disks, it needs to try inactivate the disk again.
>
> Here the issue is the block layer API bdrv_inactivate_all() will crash a
> QEMU if invoked on already inactive disks for the 2nd migration.  For
> detail, see the bug link at the end.
>
> Implementation
> ==
>
> This patch proposes to maintain disk activation with a global flag, so we
> know:
>
>   - If we used to inactivate disks for migration, but migration got
>   cancelled, or failed, QEMU will know it should reactivate the disks.
>
>   - On incoming side, if the disks are never activated but then another
>   migration is triggered, QEMU should be able to tell that inactivate is
>   not needed for the 2nd migration.
>
> We used to have disk_inactive, but it only solves the 1st issue, not the
> 2nd.  Also, it's done in completely separate paths so it's extremely hard
> to follow either how the flag changes, or the duration that the flag is
> valid, and when we will reactivate the disks.
>
> Convert the existing disk_inactive flag into that global flag (also invert
> its naming), and maintain the disk activation status for the whole
> lifecycle of qemu.  That includes the incoming QEMU.
>
> Put both of the error cases of source migration (failure, cancelled)
> together into migration_iteration_finish(), which will be invoked for
> either of the scenario.  So from that part QEMU should behave the same as
> before.  However with such global maintenance on disk activation status, we
> not only cleanup quite a few temporary paths that we try to maintain the
> disk activation status (e.g. in postcopy code), meanwhile it fixes the
> crash for problem 2 in one shot.
>
> For freshly started QEMU, the flag is initialized to TRUE showing that the
> QEMU owns the disks by default.
>
> For incoming migrated QEMU, the flag will be initialized to FALSE once and
> for all showing that the dest QEMU doesn't own the disks until switchover.
> That is guaranteed by the "once" variable.
>
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395
> Signed-off-by: Peter Xu 

Reviewed-by: Fabiano Rosas 

Just a note about errp below and a comment about a pre-existing
condition, but nothing that requires action.

> ---
>  include/migration/misc.h |  4 ++
>  migration/migration.h|  6 +--
>  migration/block-active.c | 94 
>  migration/colo.c |  2 +-
>  migration/migration.c| 80 --
>  migration/savevm.c   | 33 ++
>  monitor/qmp-cmds.c   |  8 +---
>  migration/meson.build|  1 +
>  migration/trace-events   |  3 ++
>  9 files changed, 140 insertions(+), 91 deletions(-)
>  create mode 100644 migration/block

Re: libnfs 6.0.0 breaks qemu compilation

2024-12-16 Thread Daniel P . Berrangé
On Sat, Dec 14, 2024 at 01:40:45PM +0100, Paolo Bonzini wrote:
> Il ven 13 dic 2024, 20:19 Richard W.M. Jones  ha scritto:
> 
> > On Fri, Dec 13, 2024 at 07:37:10PM +0100, Paolo Bonzini wrote:
> > > Yeah, and I don't think it should be merged, unless libnfs support is
> > dropped
> > > from the QEMU build in rawhide.
> >
> > Sure if there's no easy fix on the horizon, we can remove libnfs
> > support temporarily.
> >
> 
> Can we just keep the old libnfs indefinitely? Are there any killer features
> for dependencies other than QEMU?

QEMU needs fixing to work with both old and new libnfs. Even if
Fedora pinned to old libnfs, we can be sure that sooner or later
the new libnfs will appear in other distros and thus will inevitably
need to deal with this incompatibility.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH] vvfat: fix ubsan issue in create_long_filename

2024-12-16 Thread Pierrick Bouvier

On 12/4/24 11:51, Pierrick Bouvier wrote:

Found with test sbsaref introduced in [1].

[1] 
https://patchew.org/QEMU/20241203213629.2482806-1-pierrick.bouv...@linaro.org/

../block/vvfat.c:433:24: runtime error: index 14 out of bounds for type 
'uint8_t [11]'
 #0 0x56151a66b93a in create_long_filename ../block/vvfat.c:433
 #1 0x56151a66f3d7 in create_short_and_long_name ../block/vvfat.c:725
 #2 0x56151a670403 in read_directory ../block/vvfat.c:804
 #3 0x56151a674432 in init_directories ../block/vvfat.c:964
 #4 0x56151a67867b in vvfat_open ../block/vvfat.c:1258
 #5 0x56151a3b8e19 in bdrv_open_driver ../block.c:1660
 #6 0x56151a3bb666 in bdrv_open_common ../block.c:1985
 #7 0x56151a3cadb9 in bdrv_open_inherit ../block.c:4153
 #8 0x56151a3c8850 in bdrv_open_child_bs ../block.c:3731
 #9 0x56151a3ca832 in bdrv_open_inherit ../block.c:4098
 #10 0x56151a3cbe40 in bdrv_open ../block.c:4248
 #11 0x56151a46344f in blk_new_open ../block/block-backend.c:457
 #12 0x56151a388bd9 in blockdev_init ../blockdev.c:612
 #13 0x56151a38ab2d in drive_new ../blockdev.c:1006
 #14 0x5615190fca41 in drive_init_func ../system/vl.c:649
 #15 0x56151aa796dd in qemu_opts_foreach ../util/qemu-option.c:1135
 #16 0x5615190fd2b6 in configure_blockdev ../system/vl.c:708
 #17 0x56151910a307 in qemu_create_early_backends ../system/vl.c:2004
 #18 0x561519113fcf in qemu_init ../system/vl.c:3685
 #19 0x56151a7e438e in main ../system/main.c:47
 #20 0x7f72d1a46249 in __libc_start_call_main 
../sysdeps/nptl/libc_start_call_main.h:58
 #21 0x7f72d1a46304 in __libc_start_main_impl ../csu/libc-start.c:360
 #22 0x561517e98510 in _start 
(/home/user/.work/qemu/build/qemu-system-aarch64+0x3b9b510)

The offset used can easily go beyond entry->name size. It's probably a
bug, but I don't have the time to dive into vfat specifics for now.

This change solves the ubsan issue, and is functionally equivalent, as
anything written past the entry->name array would not be read anyway.

Signed-off-by: Pierrick Bouvier 
---
  block/vvfat.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/block/vvfat.c b/block/vvfat.c
index 8ffe8b3b9bf..f2eafaa9234 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -426,6 +426,10 @@ static direntry_t *create_long_filename(BDRVVVFATState *s, 
const char *filename)
  else if(offset<22) offset=14+offset-10;
  else offset=28+offset-22;
  entry=array_get(&(s->directory),s->directory.next-1-(i/26));
+/* ensure we don't write anything past entry->name */
+if (offset >= sizeof(entry->name)) {
+continue;
+}
  if (i >= 2 * length + 2) {
  entry->name[offset] = 0xff;
  } else if (i % 2 == 0) {


cc qemu-trivial.