> -----Original Message----- > From: Klaus Jensen <i...@irrelevant.dk> > Sent: Monday, September 28, 2020 6:43 AM > To: Dmitry Fomichev <dmitry.fomic...@wdc.com> > Cc: Keith Busch <kbu...@kernel.org>; Klaus Jensen > <k.jen...@samsung.com>; Kevin Wolf <kw...@redhat.com>; Philippe > Mathieu-Daudé <phi...@redhat.com>; Maxim Levitsky > <mlevi...@redhat.com>; Fam Zheng <f...@euphon.net>; Niklas Cassel > <niklas.cas...@wdc.com>; Damien Le Moal <damien.lem...@wdc.com>; > qemu-bl...@nongnu.org; qemu-devel@nongnu.org; Alistair Francis > <alistair.fran...@wdc.com>; Matias Bjorling <matias.bjorl...@wdc.com> > Subject: Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace > Command Set > > On Sep 28 11:35, Dmitry Fomichev wrote: > > The emulation code has been changed to advertise NVM Command Set > when > > "zoned" device property is not set (default) and Zoned Namespace > > Command Set otherwise. > > > > Handlers for three new NVMe commands introduced in Zoned Namespace > > Command Set specification are added, namely for Zone Management > > Receive, Zone Management Send and Zone Append. > > > > Device initialization code has been extended to create a proper > > configuration for zoned operation using device properties. > > > > Read/Write command handler is modified to only allow writes at the > > write pointer if the namespace is zoned. For Zone Append command, > > writes implicitly happen at the write pointer and the starting write > > pointer value is returned as the result of the command. Write Zeroes > > handler is modified to add zoned checks that are identical to those > > done as a part of Write flow. > > > > The code to support for Zone Descriptor Extensions is not included in > > this commit and ZDES 0 is always reported. A later commit in this > > series will add ZDE support. > > > > This commit doesn't yet include checks for active and open zone > > limits. It is assumed that there are no limits on either active or > > open zones. > > > > I think the fill_pattern feature stands separate, so it would be nice to > extract that to a patch on its own. > > > Signed-off-by: Niklas Cassel <niklas.cas...@wdc.com> > > Signed-off-by: Hans Holmberg <hans.holmb...@wdc.com> > > Signed-off-by: Ajay Joshi <ajay.jo...@wdc.com> > > Signed-off-by: Chaitanya Kulkarni <chaitanya.kulka...@wdc.com> > > Signed-off-by: Matias Bjorling <matias.bjorl...@wdc.com> > > Signed-off-by: Aravind Ramesh <aravind.ram...@wdc.com> > > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawas...@wdc.com> > > Signed-off-by: Adam Manzanares <adam.manzana...@wdc.com> > > Signed-off-by: Dmitry Fomichev <dmitry.fomic...@wdc.com> > > --- > > block/nvme.c | 2 +- > > hw/block/nvme-ns.c | 185 ++++++++- > > hw/block/nvme-ns.h | 6 +- > > hw/block/nvme.c | 872 > +++++++++++++++++++++++++++++++++++++++++-- > > include/block/nvme.h | 6 +- > > 5 files changed, 1033 insertions(+), 38 deletions(-) > > > > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h > > index 04172f083e..daa13546c4 100644 > > --- a/hw/block/nvme-ns.h > > +++ b/hw/block/nvme-ns.h > > @@ -38,7 +38,6 @@ typedef struct NvmeZoneList { > > > > typedef struct NvmeNamespaceParams { > > uint32_t nsid; > > - uint8_t csi; > > bool attached; > > QemuUUID uuid; > > > > @@ -52,6 +51,7 @@ typedef struct NvmeNamespace { > > DeviceState parent_obj; > > BlockConf blkconf; > > int32_t bootindex; > > + uint8_t csi; > > int64_t size; > > NvmeIdNs id_ns; > > This should be squashed into the namespace types patch. >
Yes, thanks. > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > > index 63ad03d6d6..38e25a4d1f 100644 > > --- a/hw/block/nvme.c > > +++ b/hw/block/nvme.c > > @@ -54,6 +54,7 @@ > > #include "qemu/osdep.h" > > #include "qemu/units.h" > > #include "qemu/error-report.h" > > +#include "crypto/random.h" > > I think this is not used until the offline/read-only zones injection > patch, right? > Indeed, will move. > > +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, > NvmeRequest *req, > > + bool failed) > > +{ > > + NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd; > > + NvmeZone *zone; > > + uint64_t slba, start_wp = req->cqe.result64; > > + uint32_t nlb, zone_idx; > > + uint8_t zs; > > + > > + if (rw->opcode != NVME_CMD_WRITE && > > + rw->opcode != NVME_CMD_ZONE_APPEND && > > + rw->opcode != NVME_CMD_WRITE_ZEROES) { > > + return false; > > + } > > + > > + slba = le64_to_cpu(rw->slba); > > + nlb = le16_to_cpu(rw->nlb) + 1; > > + zone_idx = nvme_zone_idx(ns, slba); > > + assert(zone_idx < ns->num_zones); > > + zone = &ns->zone_array[zone_idx]; > > + > > + if (!failed && zone->w_ptr < start_wp + nlb) { > > + /* > > + * A preceding queued write to the zone has failed, > > + * now this write is not at the WP, fail it too. > > + */ > > + failed = true; > > + } > > + > > + if (failed) { > > + if (zone->w_ptr > start_wp) { > > + zone->w_ptr = start_wp; > > + } > > It is possible (though unlikely) that you already posted the CQE for the > write that moved the WP to w_ptr - and now you are reverting it. This > looks like a recipe for data corruption to me. > > Take this example. I use append, because if you have multiple regular > writes in queue you're screwed anyway. > > w_ptr = 0, d.wp = 0 > append 1 lba -> w_ptr = 1, start_wp = 0, issues aio A > append 2 lbas -> w_ptr = 3, start_wp = 1, issues aio B > > aio B success -> d.wp = 2 (since you are adding nlb), > > Now, I totally do the same. Even though the zone descriptor write > pointer gets "out of sync", it will be reconciled in the absence of > failures and its fair to define that the host cannot expect a consistent > view of the write pointer without quescing I/O. > > The problem is if a write then fails: > > aio A fails -> w_ptr > start_wp (3 > 1), so you revert to w_ptr = 1 > > That looks bad to me. I dont think this is ever reconciled? If another > append then comes in: > > append 1 lba -> w_ptr = 2, start_wp = 1, issues aio C and overwrites > the second append from before. > aio C success -> d.wp = 3 (but it should be 2) > Right, need to sync w_ptr and d.wp here. Good find! > > @@ -1513,11 +2267,16 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl > *n, NvmeRequest *req) > > static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req) > > { > > NvmeIdentify *c = (NvmeIdentify *)&req->cmd; > > + NvmeIdCtrlZoned id = {}; > > > > trace_pci_nvme_identify_ctrl_csi(c->csi); > > > > if (c->csi == NVME_CSI_NVM) { > > return nvme_rpt_empty_id_struct(n, req); > > + } else if (c->csi == NVME_CSI_ZONED) { > > + id.zasl = n->zasl; > > I dont think it should overwrite the zasl value specified by the user. > If the user specified 0, then it should return 0 for zasl here. Not sure if I get this. The value of n->zasl is calculated based on the setting given by the user (or by default). > > > @@ -2310,16 +3086,28 @@ static int nvme_start_ctrl(NvmeCtrl *n) > > continue; > > } > > ns->params.attached = false; > > - switch (ns->params.csi) { > > + switch (ns->csi) { > > case NVME_CSI_NVM: > > if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY || > > NVME_CC_CSS(n->bar.cc) == CSS_CSI) { > > ns->params.attached = true; > > } > > break; > > + case NVME_CSI_ZONED: > > + if (NVME_CC_CSS(n->bar.cc) == CSS_CSI) { > > + ns->params.attached = true; > > + } > > + break; > > } > > } > > > > + if (!n->zasl_bs) { > > + assert(n->params.mdts); > > A value of 0 for MDTS is perfectly valid. Ok, need to remove this assert. > > > @@ -2382,10 +3170,11 @@ static void nvme_write_bar(NvmeCtrl *n, > hwaddr offset, uint64_t data, > > case CSS_NVM_ONLY: > > trace_pci_nvme_css_nvm_cset_selected_by_host(data & > > > > 0xffffffff); > > - break; > > + break; > > Spurious misaligned break here. Nice catch! It's misaligned by 4, so checkpatch doesn't complain about it :)