Re: [PATCH] vvfat: fix ubsan issue in create_long_filename
Hi everyone, gentle ping on this series. On 12/4/24 11:51, Pierrick Bouvier wrote: Found with test sbsaref introduced in [1]. [1] https://patchew.org/QEMU/20241203213629.2482806-1-pierrick.bouv...@linaro.org/ ../block/vvfat.c:433:24: runtime error: index 14 out of bounds for type 'uint8_t [11]' #0 0x56151a66b93a in create_long_filename ../block/vvfat.c:433 #1 0x56151a66f3d7 in create_short_and_long_name ../block/vvfat.c:725 #2 0x56151a670403 in read_directory ../block/vvfat.c:804 #3 0x56151a674432 in init_directories ../block/vvfat.c:964 #4 0x56151a67867b in vvfat_open ../block/vvfat.c:1258 #5 0x56151a3b8e19 in bdrv_open_driver ../block.c:1660 #6 0x56151a3bb666 in bdrv_open_common ../block.c:1985 #7 0x56151a3cadb9 in bdrv_open_inherit ../block.c:4153 #8 0x56151a3c8850 in bdrv_open_child_bs ../block.c:3731 #9 0x56151a3ca832 in bdrv_open_inherit ../block.c:4098 #10 0x56151a3cbe40 in bdrv_open ../block.c:4248 #11 0x56151a46344f in blk_new_open ../block/block-backend.c:457 #12 0x56151a388bd9 in blockdev_init ../blockdev.c:612 #13 0x56151a38ab2d in drive_new ../blockdev.c:1006 #14 0x5615190fca41 in drive_init_func ../system/vl.c:649 #15 0x56151aa796dd in qemu_opts_foreach ../util/qemu-option.c:1135 #16 0x5615190fd2b6 in configure_blockdev ../system/vl.c:708 #17 0x56151910a307 in qemu_create_early_backends ../system/vl.c:2004 #18 0x561519113fcf in qemu_init ../system/vl.c:3685 #19 0x56151a7e438e in main ../system/main.c:47 #20 0x7f72d1a46249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #21 0x7f72d1a46304 in __libc_start_main_impl ../csu/libc-start.c:360 #22 0x561517e98510 in _start (/home/user/.work/qemu/build/qemu-system-aarch64+0x3b9b510) The offset used can easily go beyond entry->name size. It's probably a bug, but I don't have the time to dive into vfat specifics for now. This change solves the ubsan issue, and is functionally equivalent, as anything written past the entry->name array would not be read anyway. Signed-off-by: Pierrick Bouvier --- block/vvfat.c | 4 1 file changed, 4 insertions(+) diff --git a/block/vvfat.c b/block/vvfat.c index 8ffe8b3b9bf..f2eafaa9234 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -426,6 +426,10 @@ static direntry_t *create_long_filename(BDRVVVFATState *s, const char *filename) else if(offset<22) offset=14+offset-10; else offset=28+offset-22; entry=array_get(&(s->directory),s->directory.next-1-(i/26)); +/* ensure we don't write anything past entry->name */ +if (offset >= sizeof(entry->name)) { +continue; +} if (i >= 2 * length + 2) { entry->name[offset] = 0xff; } else if (i % 2 == 0) {
Re: [PATCH v2 6/6] migration/block: Rewrite disk activation
On Mon, Dec 16, 2024 at 12:58:58PM -0300, Fabiano Rosas wrote: > > @@ -3286,6 +3237,11 @@ static void > > migration_iteration_finish(MigrationState *s) > > case MIGRATION_STATUS_FAILED: > > case MIGRATION_STATUS_CANCELLED: > > case MIGRATION_STATUS_CANCELLING: > > Pre-existing, but can we even reach here with CANCELLED? If we can start > the VM with both CANCELLED and CANCELLING, that means the > MIG_EVENT_PRECOPY_FAILED notifier is not being consistently called. So I > think CANCELLED here must be unreachable... Yeah I can't see how it's reachable, because the only place that we can set it to CANCELLED is: migrate_fd_cleanup(): if (s->state == MIGRATION_STATUS_CANCELLING) { migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING, MIGRATION_STATUS_CANCELLED); } In this specific context, it (as a bottom half) can only be scheduled after this path. Looks like the event MIG_EVENT_PRECOPY_FAILED will work regardless, though? As that's also done after above: type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED : MIG_EVENT_PRECOPY_DONE; migration_call_notifiers(s, type, NULL); So looks like no matter it was CANCELLING or CANCELLED, it'll always be CANCELLED when reaching migration_has_failed(). [...] > > @@ -103,13 +104,8 @@ void qmp_cont(Error **errp) > > * Continuing after completed migration. Images have been > > * inactivated to allow the destination to take control. Need to > > * get control back now. > > - * > > - * If there are no inactive block nodes (e.g. because the VM was > > - * just paused rather than completing a migration), > > - * bdrv_inactivate_all() simply doesn't do anything. > > */ > > -bdrv_activate_all(&local_err); > > -if (local_err) { > > +if (!migration_block_activate(&local_err)) { > > error_propagate(errp, local_err); > > Could use errp directly here. True.. -- Peter Xu
Re: libnfs 6.0.0 breaks qemu compilation
Le 2024-12-16 11:43, Daniel P. Berrangé a écrit : On Sat, Dec 14, 2024 at 01:40:45PM +0100, Paolo Bonzini wrote: Il ven 13 dic 2024, 20:19 Richard W.M. Jones ha scritto: > On Fri, Dec 13, 2024 at 07:37:10PM +0100, Paolo Bonzini wrote: > > Yeah, and I don't think it should be merged, unless libnfs support is > dropped > > from the QEMU build in rawhide. > > Sure if there's no easy fix on the horizon, we can remove libnfs > support temporarily. > Can we just keep the old libnfs indefinitely? Are there any killer features for dependencies other than QEMU? QEMU needs fixing to work with both old and new libnfs. Even if Fedora pinned to old libnfs, we can be sure that sooner or later the new libnfs will appear in other distros and thus will inevitably need to deal with this incompatibility. To answer a previous question, there's no hurry in switching to the new libnfs 6, especially as all affected code will need to be patched to acknowledge for the new API. However, it will indeed need to be fixed sooner or later, I don't think upstream will maintain the old libnfs 5 branch. Regards, Xavier
Re: [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed
Peter Xu writes: > As the comment says, the activation of disks is for the case where > migration has completed, rather than when QEMU is still during > migration (RUN_STATE_INMIGRATE). > > Move the code over to reflect what the comment is describing. > > Cc: Kevin Wolf > Cc: Markus Armbruster > Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas
Re: [PATCH v2 3/6] migration/block: Make late-block-active the default
Peter Xu writes: > Migration capability 'late-block-active' controls when the block drives > will be activated. If enabled, block drives will only be activated until > VM starts, either src runstate was "live" (RUNNING, or SUSPENDED), or it'll > be postponed until qmp_cont(). > > Let's do this unconditionally. There's no harm to delay activation of > block drives. Meanwhile there's no ABI breakage if dest does it, because > src QEMU has nothing to do with it, so it's no concern on ABI breakage. > > IIUC we could avoid introducing this cap when introducing it before, but > now it's still not too late to just always do it. Cap now prone to > removal, but it'll be for later patches. > > Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas
Re: [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy
Peter Xu writes: > Postcopy never cared about late-block-active. However there's no mention > in the capability that it doesn't apply to postcopy. > > Considering that we _assumed_ late activation is always good, do that too > for postcopy unconditionally, just like precopy. After this patch, we > should have unified the behavior across all. > > Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas
Re: [PATCH v2 5/6] migration/block: Fix possible race with block_inactive
Peter Xu writes: > Src QEMU sets block_inactive=true very early before the invalidation takes > place. It means if something wrong happened during setting the flag but > before reaching qemu_savevm_state_complete_precopy_non_iterable() where it > did the invalidation work, it'll make block_inactive flag inconsistent. > > For example, think about when qemu_savevm_state_complete_precopy_iterable() > can fail: it will have block_inactive set to true even if all block drives > are active. > > Fix that by only update the flag after the invalidation is done. > > No Fixes for any commit, because it's not an issue if bdrv_activate_all() > is re-entrant upon all-active disks - false positive block_inactive can > bring nothing more than "trying to active the blocks but they're already > active". However let's still do it right to avoid the inconsistent flag > v.s. reality. > > Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas
[PATCH 1/9] hw/nvme: always initialize a subsystem
From: Klaus Jensen If no nvme-subsys is explicitly configured, instantiate one. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 36 ++--- hw/nvme/ns.c | 64 -- 2 files changed, 42 insertions(+), 58 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index ec754195669598082dd573ff1237dbbfb9b8aff5..e1d54ee34d2cd073821ac398bc5b4a51cb79e0e9 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -8759,15 +8759,13 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->psd[0].enlat = cpu_to_le32(0x10); id->psd[0].exlat = cpu_to_le32(0x4); -if (n->subsys) { -id->cmic |= NVME_CMIC_MULTI_CTRL; -ctratt |= NVME_CTRATT_ENDGRPS; +id->cmic |= NVME_CMIC_MULTI_CTRL; +ctratt |= NVME_CTRATT_ENDGRPS; -id->endgidmax = cpu_to_le16(0x1); +id->endgidmax = cpu_to_le16(0x1); -if (n->subsys->endgrp.fdp.enabled) { -ctratt |= NVME_CTRATT_FDPS; -} +if (n->subsys->endgrp.fdp.enabled) { +ctratt |= NVME_CTRATT_FDPS; } id->ctratt = cpu_to_le32(ctratt); @@ -8796,7 +8794,15 @@ static int nvme_init_subsys(NvmeCtrl *n, Error **errp) int cntlid; if (!n->subsys) { -return 0; +DeviceState *dev = qdev_new(TYPE_NVME_SUBSYS); + +qdev_prop_set_string(dev, "nqn", n->params.serial); + +if (!qdev_realize(dev, NULL, errp)) { +return -1; +} + +n->subsys = NVME_SUBSYS(dev); } cntlid = nvme_subsys_register_ctrl(n, errp); @@ -8886,17 +8892,15 @@ static void nvme_exit(PCIDevice *pci_dev) nvme_ctrl_reset(n, NVME_RESET_FUNCTION); -if (n->subsys) { -for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { -ns = nvme_ns(n, i); -if (ns) { -ns->attached--; -} +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { +ns = nvme_ns(n, i); +if (ns) { +ns->attached--; } - -nvme_subsys_unregister_ctrl(n->subsys, n); } +nvme_subsys_unregister_ctrl(n->subsys, n); + g_free(n->cq); g_free(n->sq); g_free(n->aer_reqs); diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c index 526e15aa80187b82f9622445849b03fd472da8df..3be43503c50798db0ab528fe30ad901bb6aa9db3 100644 --- a/hw/nvme/ns.c +++ b/hw/nvme/ns.c @@ -727,25 +727,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp) uint32_t nsid = ns->params.nsid; int i; -if (!n->subsys) { -/* If no subsys, the ns cannot be attached to more than one ctrl. */ -ns->params.shared = false; -if (ns->params.detached) { -error_setg(errp, "detached requires that the nvme device is " - "linked to an nvme-subsys device"); -return; -} -} else { -/* - * If this namespace belongs to a subsystem (through a link on the - * controller device), reparent the device. - */ -if (!qdev_set_parent_bus(dev, &subsys->bus.parent_bus, errp)) { -return; -} -ns->subsys = subsys; -ns->endgrp = &subsys->endgrp; +assert(subsys); + +/* reparent to subsystem bus */ +if (!qdev_set_parent_bus(dev, &subsys->bus.parent_bus, errp)) { +return; } +ns->subsys = subsys; +ns->endgrp = &subsys->endgrp; if (nvme_ns_setup(ns, errp)) { return; @@ -753,7 +742,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp) if (!nsid) { for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { -if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) { +if (nvme_subsys_ns(subsys, i)) { continue; } @@ -765,38 +754,29 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp) error_setg(errp, "no free namespace id"); return; } -} else { -if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) { -error_setg(errp, "namespace id '%d' already allocated", nsid); -return; -} +} else if (nvme_subsys_ns(subsys, nsid)) { +error_setg(errp, "namespace id '%d' already allocated", nsid); +return; } -if (subsys) { -subsys->namespaces[nsid] = ns; +subsys->namespaces[nsid] = ns; -ns->id_ns.endgid = cpu_to_le16(0x1); -ns->id_ns_ind.endgrpid = cpu_to_le16(0x1); +ns->id_ns.endgid = cpu_to_le16(0x1); +ns->id_ns_ind.endgrpid = cpu_to_le16(0x1); -if (ns->params.detached) { -return; -} +if (ns->params.detached) { +return; +} -if (ns->params.shared) { -for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) { -NvmeCtrl *ctrl = subsys->ctrls[i]; +if (ns->params.shared) { +for (i = 0; i < ARRAY_SIZE(subsys->ctrls); i++) { +NvmeCtrl *ctrl = subsys->ctrls[i]; -i
[PATCH 6/9] hw/nvme: rework csi handling
From: Klaus Jensen The controller incorrectly allows a zoned namespace to be attached even if CS.CSS is configured to only support the NVM command set for I/O queues. Rework handling of namespace command sets in general by attaching supported namespaces when the controller is started instead of, like now, statically when realized. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 242 --- hw/nvme/ns.c | 14 --- hw/nvme/nvme.h | 5 +- include/block/nvme.h | 10 ++- 4 files changed, 142 insertions(+), 129 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 22a8c400bae332285d015e8b590de159fd7d1b7a..8d3f62c40ac14fdc4bdc650e272023558cbbae0f 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -277,36 +277,32 @@ static const uint32_t nvme_cse_acs_default[256] = { [NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP, -[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC, +[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC | + NVME_CMD_EFF_CCC, [NVME_ADM_CMD_FORMAT_NVM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_ADM_CMD_DIRECTIVE_RECV] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_DIRECTIVE_SEND] = NVME_CMD_EFF_CSUPP, }; -static const uint32_t nvme_cse_iocs_none[256]; - -static const uint32_t nvme_cse_iocs_nvm[256] = { -[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP, -[NVME_CMD_DSM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_VERIFY] = NVME_CMD_EFF_CSUPP, -[NVME_CMD_COPY] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_COMPARE] = NVME_CMD_EFF_CSUPP, -[NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP, -[NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, +#define NVME_CSE_IOCS_NVM_DEFAULT \ +[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \ +[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \ +[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \ +[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP, \ +[NVME_CMD_DSM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \ +[NVME_CMD_VERIFY] = NVME_CMD_EFF_CSUPP, \ +[NVME_CMD_COPY] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, \ +[NVME_CMD_COMPARE] = NVME_CMD_EFF_CSUPP, \ +[NVME_CMD_IO_MGMT_RECV] = NVME_CMD_EFF_CSUPP, \ +[NVME_CMD_IO_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC + +static const uint32_t nvme_cse_iocs_nvm_default[256] = { +NVME_CSE_IOCS_NVM_DEFAULT, }; -static const uint32_t nvme_cse_iocs_zoned[256] = { -[NVME_CMD_FLUSH]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_WRITE]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_READ] = NVME_CMD_EFF_CSUPP, -[NVME_CMD_DSM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_VERIFY] = NVME_CMD_EFF_CSUPP, -[NVME_CMD_COPY] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, -[NVME_CMD_COMPARE] = NVME_CMD_EFF_CSUPP, +static const uint32_t nvme_cse_iocs_zoned_default[256] = { +NVME_CSE_IOCS_NVM_DEFAULT, + [NVME_CMD_ZONE_APPEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFF_CSUPP, @@ -4603,6 +4599,61 @@ static uint16_t nvme_io_mgmt_send(NvmeCtrl *n, NvmeRequest *req) }; } +static uint16_t __nvme_io_cmd_nvm(NvmeCtrl *n, NvmeRequest *req) +{ +switch (req->cmd.opcode) { +case NVME_CMD_WRITE: +return nvme_write(n, req); +case NVME_CMD_READ: +return nvme_read(n, req); +case NVME_CMD_COMPARE: +return nvme_compare(n, req); +case NVME_CMD_WRITE_ZEROES: +return nvme_write_zeroes(n, req); +case NVME_CMD_DSM: +return nvme_dsm(n, req); +case NVME_CMD_VERIFY: +return nvme_verify(n, req); +case NVME_CMD_COPY: +return nvme_copy(n, req); +case NVME_CMD_IO_MGMT_RECV: +return nvme_io_mgmt_recv(n, req); +case NVME_CMD_IO_MGMT_SEND: +return nvme_io_mgmt_send(n, req); +} + +g_assert_not_reached(); +} + +static uint16_
[PATCH 4/9] nvme: fix iocs status code values
From: Klaus Jensen The status codes related to I/O Command Sets are in the wrong group. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 4 ++-- include/block/nvme.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index d544789f92ffe6b758ce35cecfc025d87efb9b7e..120a1ca1076c8110d8550a5e75082c6ed4f23e16 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -5623,7 +5623,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req, bool active) return nvme_c2h(n, (uint8_t *)&ns->id_ns, sizeof(NvmeIdNs), req); } -return NVME_INVALID_CMD_SET | NVME_DNR; +return NVME_INVALID_IOCS | NVME_DNR; } static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req, @@ -6589,7 +6589,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req) case NVME_COMMAND_SET_PROFILE: if (dw11 & 0x1ff) { trace_pci_nvme_err_invalid_iocsci(dw11 & 0x1ff); -return NVME_CMD_SET_CMB_REJECTED | NVME_DNR; +return NVME_IOCS_COMBINATION_REJECTED | NVME_DNR; } break; case NVME_FDP_MODE: diff --git a/include/block/nvme.h b/include/block/nvme.h index a68a07455d0330b8f7cc283da0a5eadbcc140dab..145a0b65933a699504d6d89222f7979a06f615df 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -906,8 +906,6 @@ enum NvmeStatusCodes { NVME_SGL_DESCR_TYPE_INVALID = 0x0011, NVME_INVALID_USE_OF_CMB = 0x0012, NVME_INVALID_PRP_OFFSET = 0x0013, -NVME_CMD_SET_CMB_REJECTED = 0x002b, -NVME_INVALID_CMD_SET= 0x002c, NVME_FDP_DISABLED = 0x0029, NVME_INVALID_PHID_LIST = 0x002a, NVME_LBA_RANGE = 0x0080, @@ -940,6 +938,8 @@ enum NvmeStatusCodes { NVME_INVALID_SEC_CTRL_STATE = 0x0120, NVME_INVALID_NUM_RESOURCES = 0x0121, NVME_INVALID_RESOURCE_ID= 0x0122, +NVME_IOCS_COMBINATION_REJECTED = 0x012b, +NVME_INVALID_IOCS = 0x012c, NVME_CONFLICTING_ATTRS = 0x0180, NVME_INVALID_PROT_INFO = 0x0181, NVME_WRITE_TO_RO= 0x0182, -- 2.45.2
[PATCH 8/9] hw/nvme: set error status code explicitly for misc commands
From: Klaus Jensen The nvme_aio_err() does not handle Verify, Compare, Copy and other misc commands and defaults to setting the error status code to Internal Device Error. For some of these commands, we know better, so set it explicitly. For the commands using the nvme_misc_cb() callback (Copy, Flush, ...), if no status code has explicitly been set by the lower handlers, default to Internal Device Error as previously. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 28 ++-- include/block/nvme.h | 1 + 2 files changed, 23 insertions(+), 6 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 5b1bac020f049cc2a2f869b12e1d2a7e13cef316..8192f92227d6509b8d15fde9d9197a59277eb86f 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1765,7 +1765,6 @@ static void nvme_aio_err(NvmeRequest *req, int ret) case NVME_CMD_READ: status = NVME_UNRECOVERED_READ; break; -case NVME_CMD_FLUSH: case NVME_CMD_WRITE: case NVME_CMD_WRITE_ZEROES: case NVME_CMD_ZONE_APPEND: @@ -2151,11 +2150,16 @@ static inline bool nvme_is_write(NvmeRequest *req) static void nvme_misc_cb(void *opaque, int ret) { NvmeRequest *req = opaque; +uint16_t cid = nvme_cid(req); -trace_pci_nvme_misc_cb(nvme_cid(req)); +trace_pci_nvme_misc_cb(cid); if (ret) { -nvme_aio_err(req, ret); +if (!req->status) { +req->status = NVME_INTERNAL_DEV_ERROR; +} + +trace_pci_nvme_err_aio(cid, strerror(-ret), req->status); } nvme_enqueue_req_completion(nvme_cq(req), req); @@ -2258,7 +2262,10 @@ static void nvme_verify_cb(void *opaque, int ret) if (ret) { block_acct_failed(stats, acct); -nvme_aio_err(req, ret); +req->status = NVME_UNRECOVERED_READ; + +trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status); + goto out; } @@ -2357,7 +2364,10 @@ static void nvme_compare_mdata_cb(void *opaque, int ret) if (ret) { block_acct_failed(stats, acct); -nvme_aio_err(req, ret); +req->status = NVME_UNRECOVERED_READ; + +trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status); + goto out; } @@ -2439,7 +2449,10 @@ static void nvme_compare_data_cb(void *opaque, int ret) if (ret) { block_acct_failed(stats, acct); -nvme_aio_err(req, ret); +req->status = NVME_UNRECOVERED_READ; + +trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status); + goto out; } @@ -2918,6 +2931,7 @@ static void nvme_copy_out_completed_cb(void *opaque, int ret) if (ret < 0) { iocb->ret = ret; +req->status = NVME_WRITE_FAULT; goto out; } else if (iocb->ret < 0) { goto out; @@ -2982,6 +2996,7 @@ static void nvme_copy_in_completed_cb(void *opaque, int ret) if (ret < 0) { iocb->ret = ret; +req->status = NVME_UNRECOVERED_READ; goto out; } else if (iocb->ret < 0) { goto out; @@ -3504,6 +3519,7 @@ static void nvme_flush_ns_cb(void *opaque, int ret) if (ret < 0) { iocb->ret = ret; +iocb->req->status = NVME_WRITE_FAULT; goto out; } else if (iocb->ret < 0) { goto out; diff --git a/include/block/nvme.h b/include/block/nvme.h index 66d49b641aa1e89c12103e548320d89995fbbfae..3c8a9ba3c7956c1d475857a1068074338643f77f 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -906,6 +906,7 @@ enum NvmeStatusCodes { NVME_SGL_DESCR_TYPE_INVALID = 0x0011, NVME_INVALID_USE_OF_CMB = 0x0012, NVME_INVALID_PRP_OFFSET = 0x0013, +NVME_COMMAND_INTERRUPTED= 0x0021, NVME_FDP_DISABLED = 0x0029, NVME_INVALID_PHID_LIST = 0x002a, NVME_LBA_RANGE = 0x0080, -- 2.45.2
[PATCH 2/9] hw/nvme: make oacs dynamic
From: Klaus Jensen Virtualization Management needs sriov-related parameters. Only report support for the command when that conditions are true. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 25 ++--- hw/nvme/nvme.h | 4 include/block/nvme.h | 3 ++- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index e1d54ee34d2cd073821ac398bc5b4a51cb79e0e9..0e95c07c5314fa33674963ef2cea74c78954e86b 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -266,7 +266,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = { [NVME_FDP_EVENTS] = NVME_FEAT_CAP_CHANGE | NVME_FEAT_CAP_NS, }; -static const uint32_t nvme_cse_acs[256] = { +static const uint32_t nvme_cse_acs_default[256] = { [NVME_ADM_CMD_DELETE_SQ]= NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_CREATE_SQ]= NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFF_CSUPP, @@ -278,7 +278,6 @@ static const uint32_t nvme_cse_acs[256] = { [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC, -[NVME_ADM_CMD_VIRT_MNGMT] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_FORMAT_NVM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_ADM_CMD_DIRECTIVE_RECV] = NVME_CMD_EFF_CSUPP, @@ -5135,7 +5134,7 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint8_t csi, uint32_t buf_len, } } -memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs)); +memcpy(log.acs, n->cse.acs, sizeof(log.acs)); if (src_iocs) { memcpy(log.iocs, src_iocs, sizeof(log.iocs)); @@ -7242,7 +7241,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req) trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode, nvme_adm_opc_str(req->cmd.opcode)); -if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) { +if (!(n->cse.acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) { trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode); return NVME_INVALID_OPCODE | NVME_DNR; } @@ -8676,6 +8675,9 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) uint64_t cap = ldq_le_p(&n->bar.cap); NvmeSecCtrlEntry *sctrl = nvme_sctrl(n); uint32_t ctratt; +uint16_t oacs; + +memcpy(n->cse.acs, nvme_cse_acs_default, sizeof(n->cse.acs)); id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID)); id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID)); @@ -8706,9 +8708,18 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->mdts = n->params.mdts; id->ver = cpu_to_le32(NVME_SPEC_VER); -id->oacs = -cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF | -NVME_OACS_DIRECTIVES); + +oacs = NVME_OACS_NMS | NVME_OACS_FORMAT | NVME_OACS_DBBUF | +NVME_OACS_DIRECTIVES; + +if (n->params.sriov_max_vfs) { +oacs |= NVME_OACS_VMS; + +n->cse.acs[NVME_ADM_CMD_VIRT_MNGMT] = NVME_CMD_EFF_CSUPP; +} + +id->oacs = cpu_to_le16(oacs); + id->cntrltype = 0x1; /* diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 724220691057063a972be2a0ada44d2164266144..191b6c5398d0c4583051a6a9773c677a49caffd6 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -583,6 +583,10 @@ typedef struct NvmeCtrl { uint64_tdbbuf_eis; booldbbuf_enabled; +struct { +uint32_t acs[256]; +} cse; + struct { MemoryRegion mem; uint8_t *buf; diff --git a/include/block/nvme.h b/include/block/nvme.h index f4d108841bf595f1176e9cb2c910e230181c67f6..9ebee69369d6bfa6835154a91b2bdaaf7984bf0c 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -1192,8 +1192,9 @@ enum NvmeIdCtrlOacs { NVME_OACS_SECURITY = 1 << 0, NVME_OACS_FORMAT= 1 << 1, NVME_OACS_FW= 1 << 2, -NVME_OACS_NS_MGMT = 1 << 3, +NVME_OACS_NMS = 1 << 3, NVME_OACS_DIRECTIVES= 1 << 5, +NVME_OACS_VMS = 1 << 7, NVME_OACS_DBBUF = 1 << 8, }; -- 2.45.2
[PATCH 7/9] hw/nvme: only set command abort requested when cancelled due to Abort
From: Klaus Jensen The Command Abort Requested status code should only be set if the command was explicitly cancelled due to an Abort command. Or, in the case the cancel was due to Submission Queue deletion, set the status code to Command Aborted due to SQ Deletion. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 8d3f62c40ac14fdc4bdc650e272023558cbbae0f..5b1bac020f049cc2a2f869b12e1d2a7e13cef316 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1777,10 +1777,6 @@ static void nvme_aio_err(NvmeRequest *req, int ret) break; } -if (ret == -ECANCELED) { -status = NVME_CMD_ABORT_REQ; -} - trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), status); error_setg_errno(&local_err, -ret, "aio failed"); @@ -4821,6 +4817,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req) while (!QTAILQ_EMPTY(&sq->out_req_list)) { r = QTAILQ_FIRST(&sq->out_req_list); assert(r->aiocb); +r->status = NVME_CMD_ABORT_SQ_DEL; blk_aio_cancel(r->aiocb); } @@ -6073,6 +6070,7 @@ static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req) QTAILQ_FOREACH_SAFE(r, &sq->out_req_list, entry, next) { if (r->cqe.cid == cid) { if (r->aiocb) { +r->status = NVME_CMD_ABORT_REQ; blk_aio_cancel_async(r->aiocb); } break; -- 2.45.2
[PATCH 5/9] hw/nvme: be compliant wrt. dsm processing limits
From: Klaus Jensen The specification states that, > The controller shall set all three processing limit fields (i.e., the > DMRL, DMRSL and DMSL fields) to non-zero values or shall clear all > three processing limit fields to 0h. So, set the DMRL and DMSL fields in addition to DMRSL. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 24 +++- include/block/nvme.h | 2 ++ 2 files changed, 17 insertions(+), 9 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 120a1ca1076c8110d8550a5e75082c6ed4f23e16..22a8c400bae332285d015e8b590de159fd7d1b7a 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -5581,7 +5581,9 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req) switch (c->csi) { case NVME_CSI_NVM: id_nvm->vsl = n->params.vsl; +id_nvm->dmrl = NVME_ID_CTRL_NVM_DMRL_MAX; id_nvm->dmrsl = cpu_to_le32(n->dmrsl); +id_nvm->dmsl = NVME_ID_CTRL_NVM_DMRL_MAX * n->dmrsl; break; case NVME_CSI_ZONED: @@ -6638,18 +6640,23 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req) return NVME_NO_COMPLETE; } -static void nvme_update_dmrsl(NvmeCtrl *n) +static void nvme_update_dsm_limits(NvmeCtrl *n, NvmeNamespace *ns) { -int nsid; +if (ns) { +n->dmrsl = +MIN_NON_ZERO(n->dmrsl, BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); -for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { -NvmeNamespace *ns = nvme_ns(n, nsid); +return; +} + +for (uint32_t nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { +ns = nvme_ns(n, nsid); if (!ns) { continue; } -n->dmrsl = MIN_NON_ZERO(n->dmrsl, -BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); +n->dmrsl = +MIN_NON_ZERO(n->dmrsl, BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); } } @@ -6737,7 +6744,7 @@ static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) ctrl->namespaces[nsid] = NULL; ns->attached--; -nvme_update_dmrsl(ctrl); +nvme_update_dsm_limits(ctrl, NULL); break; @@ -8838,8 +8845,7 @@ void nvme_attach_ns(NvmeCtrl *n, NvmeNamespace *ns) n->namespaces[nsid] = ns; ns->attached++; -n->dmrsl = MIN_NON_ZERO(n->dmrsl, -BDRV_REQUEST_MAX_BYTES / nvme_l2b(ns, 1)); +nvme_update_dsm_limits(n, ns); } static void nvme_realize(PCIDevice *pci_dev, Error **errp) diff --git a/include/block/nvme.h b/include/block/nvme.h index 145a0b65933a699504d6d89222f7979a06f615df..f3f0317524d129f518698c6797ed37a7ac0ac847 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -1167,6 +1167,8 @@ typedef struct NvmeIdCtrlZoned { uint8_t rsvd1[4095]; } NvmeIdCtrlZoned; +#define NVME_ID_CTRL_NVM_DMRL_MAX 255 + typedef struct NvmeIdCtrlNvm { uint8_t vsl; uint8_t wzsl; -- 2.45.2
[PATCH 3/9] hw/nvme: add knob for doorbell buffer config support
From: Klaus Jensen Add a 'dbcs' knob to allow Doorbell Buffer Config command to be disabled. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 11 --- hw/nvme/nvme.h | 1 + include/block/nvme.h | 2 +- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 0e95c07c5314fa33674963ef2cea74c78954e86b..d544789f92ffe6b758ce35cecfc025d87efb9b7e 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -278,7 +278,6 @@ static const uint32_t nvme_cse_acs_default[256] = { [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC, -[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_FORMAT_NVM] = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC, [NVME_ADM_CMD_DIRECTIVE_RECV] = NVME_CMD_EFF_CSUPP, [NVME_ADM_CMD_DIRECTIVE_SEND] = NVME_CMD_EFF_CSUPP, @@ -8709,8 +8708,13 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->mdts = n->params.mdts; id->ver = cpu_to_le32(NVME_SPEC_VER); -oacs = NVME_OACS_NMS | NVME_OACS_FORMAT | NVME_OACS_DBBUF | -NVME_OACS_DIRECTIVES; +oacs = NVME_OACS_NMS | NVME_OACS_FORMAT | NVME_OACS_DIRECTIVES; + +if (n->params.dbcs) { +oacs |= NVME_OACS_DBCS; + +n->cse.acs[NVME_ADM_CMD_DBBUF_CONFIG] = NVME_CMD_EFF_CSUPP; +} if (n->params.sriov_max_vfs) { oacs |= NVME_OACS_VMS; @@ -8960,6 +8964,7 @@ static Property nvme_props[] = { DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false), DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false), DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false), +DEFINE_PROP_BOOL("dbcs", NvmeCtrl, params.dbcs, true), DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0), DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl, params.auto_transition_zones, true), diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 191b6c5398d0c4583051a6a9773c677a49caffd6..cb314e91af32a20f47e0a393e2458b7d4bdd03d9 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -539,6 +539,7 @@ typedef struct NvmeParams { bool auto_transition_zones; bool legacy_cmb; bool ioeventfd; +bool dbcs; uint16_t sriov_max_vfs; uint16_t sriov_vq_flexible; uint16_t sriov_vi_flexible; diff --git a/include/block/nvme.h b/include/block/nvme.h index 9ebee69369d6bfa6835154a91b2bdaaf7984bf0c..a68a07455d0330b8f7cc283da0a5eadbcc140dab 100644 --- a/include/block/nvme.h +++ b/include/block/nvme.h @@ -1195,7 +1195,7 @@ enum NvmeIdCtrlOacs { NVME_OACS_NMS = 1 << 3, NVME_OACS_DIRECTIVES= 1 << 5, NVME_OACS_VMS = 1 << 7, -NVME_OACS_DBBUF = 1 << 8, +NVME_OACS_DBCS = 1 << 8, }; enum NvmeIdCtrlOncs { -- 2.45.2
[PATCH 0/9] hw/nvme: refactor/cleanup
Apart from some random small fixes here and there, the major thing here is cleaning up how we handle command sets. Prior to this series, the controller would not correctly validate namespace command sets against CC.CSS. This is fixed here. The most clean way of doing this (as far as I could tell) was to make sure an nvme-subsys device exists (creating it if necessary). This allows us to "store" the namespaces in the subsystem, using existing functionality, and attach supported namespaces when the device is started (instead of when the device is created/realized). Signed-off-by: Klaus Jensen --- Klaus Jensen (9): hw/nvme: always initialize a subsystem hw/nvme: make oacs dynamic hw/nvme: add knob for doorbell buffer config support nvme: fix iocs status code values hw/nvme: be compliant wrt. dsm processing limits hw/nvme: rework csi handling hw/nvme: only set command abort requested when cancelled due to Abort hw/nvme: set error status code explicitly for misc commands hw/nvme: remove nvme_aio_err() hw/nvme/ctrl.c | 430 --- hw/nvme/ns.c | 62 ++-- hw/nvme/nvme.h | 10 +- include/block/nvme.h | 22 ++- 4 files changed, 276 insertions(+), 248 deletions(-) --- base-commit: ca80a5d026a280762e0772615f1988db542b3ade change-id: 20241216-nvme-queue-f4151c5d7507 Best regards, -- Klaus Jensen
[PATCH 9/9] hw/nvme: remove nvme_aio_err()
From: Klaus Jensen nvme_rw_complete_cb() is the only remaining user of nvme_aio_err(), so open code the status code setting instead. Signed-off-by: Klaus Jensen --- hw/nvme/ctrl.c | 60 ++ 1 file changed, 23 insertions(+), 37 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 8192f92227d6509b8d15fde9d9197a59277eb86f..40f535b03316ed45f6cbb2894fd89f9ce258423e 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1756,42 +1756,6 @@ static uint16_t nvme_check_dulbe(NvmeNamespace *ns, uint64_t slba, return NVME_SUCCESS; } -static void nvme_aio_err(NvmeRequest *req, int ret) -{ -uint16_t status = NVME_SUCCESS; -Error *local_err = NULL; - -switch (req->cmd.opcode) { -case NVME_CMD_READ: -status = NVME_UNRECOVERED_READ; -break; -case NVME_CMD_WRITE: -case NVME_CMD_WRITE_ZEROES: -case NVME_CMD_ZONE_APPEND: -case NVME_CMD_COPY: -status = NVME_WRITE_FAULT; -break; -default: -status = NVME_INTERNAL_DEV_ERROR; -break; -} - -trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), status); - -error_setg_errno(&local_err, -ret, "aio failed"); -error_report_err(local_err); - -/* - * Set the command status code to the first encountered error but allow a - * subsequent Internal Device Error to trump it. - */ -if (req->status && status != NVME_INTERNAL_DEV_ERROR) { -return; -} - -req->status = status; -} - static inline uint32_t nvme_zone_idx(NvmeNamespace *ns, uint64_t slba) { return ns->zone_size_log2 > 0 ? slba >> ns->zone_size_log2 : @@ -2176,8 +2140,30 @@ void nvme_rw_complete_cb(void *opaque, int ret) trace_pci_nvme_rw_complete_cb(nvme_cid(req), blk_name(blk)); if (ret) { +Error *err = NULL; + block_acct_failed(stats, acct); -nvme_aio_err(req, ret); + +switch (req->cmd.opcode) { +case NVME_CMD_READ: +req->status = NVME_UNRECOVERED_READ; +break; + +case NVME_CMD_WRITE: +case NVME_CMD_WRITE_ZEROES: +case NVME_CMD_ZONE_APPEND: +req->status = NVME_WRITE_FAULT; +break; + +default: +req->status = NVME_INTERNAL_DEV_ERROR; +break; +} + +trace_pci_nvme_err_aio(nvme_cid(req), strerror(-ret), req->status); + +error_setg_errno(&err, -ret, "aio failed"); +error_report_err(err); } else { block_acct_done(stats, acct); } -- 2.45.2
Re: [PATCH v2 6/6] migration/block: Rewrite disk activation
Peter Xu writes: > This patch proposes a flag to maintain disk activation status globally. It > mostly rewrites disk activation mgmt for QEMU, including COLO and QMP > command xen_save_devices_state. > > Backgrounds > === > > We have two problems on disk activations, one resolved, one not. > > Problem 1: disk activation recover (for switchover interruptions) > ~ > > When migration is either cancelled or failed during switchover, especially > when after the disks are inactivated, QEMU needs to remember re-activate > the disks again before vm starts. > > It used to be done separately in two paths: one in qmp_migrate_cancel(), > the other one in the failure path of migration_completion(). > > It used to be fixed in different commits, all over the places in QEMU. So > these are the relevant changes I saw, I'm not sure if it's complete list: > > - In 2016, commit fe904ea824 ("migration: regain control of images when >migration fails to complete") > > - In 2017, commit 1d2acc3162 ("migration: re-active images while migration >been canceled after inactive them") > > - In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in >more failure scenarios") > > Now since we have a slightly better picture maybe we can unify the > reactivation in a single path. > > One side benefit of doing so is, we can move the disk operation outside QMP > command "migrate_cancel". It's possible that in the future we may want to > make "migrate_cancel" be OOB-compatible, while that requires the command > doesn't need BQL in the first place. This will already do that and make > migrate_cancel command lightweight. > > Problem 2: disk invalidation on top of invalidated disks > > > This is an unresolved bug for current QEMU. Link in "Resolves:" at the > end. It turns out besides the src switchover phase (problem 1 above), QEMU > also needs to remember block activation on destination. > > Consider two continuous migration in a row, where the VM was always paused. > In that scenario, the disks are not activated even until migration > completed in the 1st round. When the 2nd round starts, if QEMU doesn't > know the status of the disks, it needs to try inactivate the disk again. > > Here the issue is the block layer API bdrv_inactivate_all() will crash a > QEMU if invoked on already inactive disks for the 2nd migration. For > detail, see the bug link at the end. > > Implementation > == > > This patch proposes to maintain disk activation with a global flag, so we > know: > > - If we used to inactivate disks for migration, but migration got > cancelled, or failed, QEMU will know it should reactivate the disks. > > - On incoming side, if the disks are never activated but then another > migration is triggered, QEMU should be able to tell that inactivate is > not needed for the 2nd migration. > > We used to have disk_inactive, but it only solves the 1st issue, not the > 2nd. Also, it's done in completely separate paths so it's extremely hard > to follow either how the flag changes, or the duration that the flag is > valid, and when we will reactivate the disks. > > Convert the existing disk_inactive flag into that global flag (also invert > its naming), and maintain the disk activation status for the whole > lifecycle of qemu. That includes the incoming QEMU. > > Put both of the error cases of source migration (failure, cancelled) > together into migration_iteration_finish(), which will be invoked for > either of the scenario. So from that part QEMU should behave the same as > before. However with such global maintenance on disk activation status, we > not only cleanup quite a few temporary paths that we try to maintain the > disk activation status (e.g. in postcopy code), meanwhile it fixes the > crash for problem 2 in one shot. > > For freshly started QEMU, the flag is initialized to TRUE showing that the > QEMU owns the disks by default. > > For incoming migrated QEMU, the flag will be initialized to FALSE once and > for all showing that the dest QEMU doesn't own the disks until switchover. > That is guaranteed by the "once" variable. > > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395 > Signed-off-by: Peter Xu Reviewed-by: Fabiano Rosas Just a note about errp below and a comment about a pre-existing condition, but nothing that requires action. > --- > include/migration/misc.h | 4 ++ > migration/migration.h| 6 +-- > migration/block-active.c | 94 > migration/colo.c | 2 +- > migration/migration.c| 80 -- > migration/savevm.c | 33 ++ > monitor/qmp-cmds.c | 8 +--- > migration/meson.build| 1 + > migration/trace-events | 3 ++ > 9 files changed, 140 insertions(+), 91 deletions(-) > create mode 100644 migration/block
Re: libnfs 6.0.0 breaks qemu compilation
On Sat, Dec 14, 2024 at 01:40:45PM +0100, Paolo Bonzini wrote: > Il ven 13 dic 2024, 20:19 Richard W.M. Jones ha scritto: > > > On Fri, Dec 13, 2024 at 07:37:10PM +0100, Paolo Bonzini wrote: > > > Yeah, and I don't think it should be merged, unless libnfs support is > > dropped > > > from the QEMU build in rawhide. > > > > Sure if there's no easy fix on the horizon, we can remove libnfs > > support temporarily. > > > > Can we just keep the old libnfs indefinitely? Are there any killer features > for dependencies other than QEMU? QEMU needs fixing to work with both old and new libnfs. Even if Fedora pinned to old libnfs, we can be sure that sooner or later the new libnfs will appear in other distros and thus will inevitably need to deal with this incompatibility. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH] vvfat: fix ubsan issue in create_long_filename
On 12/4/24 11:51, Pierrick Bouvier wrote: Found with test sbsaref introduced in [1]. [1] https://patchew.org/QEMU/20241203213629.2482806-1-pierrick.bouv...@linaro.org/ ../block/vvfat.c:433:24: runtime error: index 14 out of bounds for type 'uint8_t [11]' #0 0x56151a66b93a in create_long_filename ../block/vvfat.c:433 #1 0x56151a66f3d7 in create_short_and_long_name ../block/vvfat.c:725 #2 0x56151a670403 in read_directory ../block/vvfat.c:804 #3 0x56151a674432 in init_directories ../block/vvfat.c:964 #4 0x56151a67867b in vvfat_open ../block/vvfat.c:1258 #5 0x56151a3b8e19 in bdrv_open_driver ../block.c:1660 #6 0x56151a3bb666 in bdrv_open_common ../block.c:1985 #7 0x56151a3cadb9 in bdrv_open_inherit ../block.c:4153 #8 0x56151a3c8850 in bdrv_open_child_bs ../block.c:3731 #9 0x56151a3ca832 in bdrv_open_inherit ../block.c:4098 #10 0x56151a3cbe40 in bdrv_open ../block.c:4248 #11 0x56151a46344f in blk_new_open ../block/block-backend.c:457 #12 0x56151a388bd9 in blockdev_init ../blockdev.c:612 #13 0x56151a38ab2d in drive_new ../blockdev.c:1006 #14 0x5615190fca41 in drive_init_func ../system/vl.c:649 #15 0x56151aa796dd in qemu_opts_foreach ../util/qemu-option.c:1135 #16 0x5615190fd2b6 in configure_blockdev ../system/vl.c:708 #17 0x56151910a307 in qemu_create_early_backends ../system/vl.c:2004 #18 0x561519113fcf in qemu_init ../system/vl.c:3685 #19 0x56151a7e438e in main ../system/main.c:47 #20 0x7f72d1a46249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 #21 0x7f72d1a46304 in __libc_start_main_impl ../csu/libc-start.c:360 #22 0x561517e98510 in _start (/home/user/.work/qemu/build/qemu-system-aarch64+0x3b9b510) The offset used can easily go beyond entry->name size. It's probably a bug, but I don't have the time to dive into vfat specifics for now. This change solves the ubsan issue, and is functionally equivalent, as anything written past the entry->name array would not be read anyway. Signed-off-by: Pierrick Bouvier --- block/vvfat.c | 4 1 file changed, 4 insertions(+) diff --git a/block/vvfat.c b/block/vvfat.c index 8ffe8b3b9bf..f2eafaa9234 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -426,6 +426,10 @@ static direntry_t *create_long_filename(BDRVVVFATState *s, const char *filename) else if(offset<22) offset=14+offset-10; else offset=28+offset-22; entry=array_get(&(s->directory),s->directory.next-1-(i/26)); +/* ensure we don't write anything past entry->name */ +if (offset >= sizeof(entry->name)) { +continue; +} if (i >= 2 * length + 2) { entry->name[offset] = 0xff; } else if (i % 2 == 0) { cc qemu-trivial.