Re: [PATCH 0/3] nvme: Don't add namespaces for locked drives
Hi all, If an NVMe drive is locked with ATA Security, most commands sent to the drive will fail. This includes commands sent by the kernel upon discovery to probe for partitions. The failing happens in such a way that trying to do anything with the drive (e.g. sending an unlock command; unloading the nvme module) is basically impossible with the high default command timeout. This patch adds a check to see if the drive is locked, and if it is, its namespaces are not initialized. It is expected that userspace will send the proper "security send/unlock" command and then reset the controller. Userspace tools are available at [1]. This is my first kernel patch so please let me know if you have any feedback. I intend to also submit a future patch that tracks ATA Security commands sent from userspace and remembers the password so it can be submitted to a locked drive upon pm_resume. (still WIP) Jethro Beekman [1] https://github.com/jethrogb/nvme-ata-security Jethro Beekman (3): nvme: When scanning namespaces, make sure the drive is not locked nvme: Add function for NVMe security receive command nvme: Check if drive is locked using ATA Security Hey Jethro, I think it would make better sense to squash patches 1,3 together and have patch 2 come before them: patch 1: nvme: Add function for NVMe security receive command patch 2: nvme: Check if drive is locked using ATA Security when scanning namespaces
Re: [rfc] weirdness in bio_map_user_iov()
Hey Al, What happens if we feed it a 3-element iovec array, one page in each? AFAICS, bio_add_pc_page() is called for each of those pages, even if the previous calls have failed - break is only out of the inner loop. Sure, failure due to exceeded request size means that everything after that one will fail, but what of e.g. /* * If the queue doesn't support SG gaps and adding this * offset would create a gap, disallow it. */ if (bvec_gap_to_prev(q, prev, offset)) return 0; in there? Won't we risk having the first and the third pages added, with the second one quietly skipped? Jens, looks like it had come from you (by way of jejb). Am I missing something subtle here? So AFAICT, 'gappy' iovecs will never reach bio_map_user_iov() because it is checked before in blk_rq_map_user_iov(): if (map_data) copy = true; else if (iov_iter_alignment(iter) & align) copy = true; else if (queue_virt_boundary(q)) copy = queue_virt_boundary(q) & iov_iter_gap_alignment(iter); if iov_iter_gap_alignment(iter) detects gaps it will use a nice aligned bounce via bio_copy_user_iov().
Re: [PATCH 11/13] nvme: switch to use pci_alloc_irq_vectors
On 14/09/16 07:18, Christoph Hellwig wrote: Use the new helper to automatically select the right interrupt type, as well as to use the automatic interupt affinity assignment. Patch title and the change description are a little short IMO to describe what is going on here (need the blk-mq side too). I'd also think it would be better to split this to 2 patches but really not a must... +static int nvme_pci_map_queues(struct blk_mq_tag_set *set) +{ + struct nvme_dev *dev = set->driver_data; + + return blk_mq_pci_map_queues(set, to_pci_dev(dev->dev)); +} + Question: is using pci_alloc_irq_vectors() obligated for supplying blk-mq with the device affinity mask? If I do this completely-untested [1] what will happen? [1]: -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 8d2875b4c56d..76693d406efe 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1518,6 +1518,14 @@ static void nvme_rdma_complete_rq(struct request *rq) blk_mq_end_request(rq, error); } +static int nvme_rdma_map_queues(struct blk_mq_tag_set *set) +{ + struct nvme_rdma_ctrl *ctrl = set->driver_data; + struct device *dev = ctrl->device->dev.dma_device; + + return blk_mq_pci_map_queues(set, to_pci_dev(dev)); +} + static struct blk_mq_ops nvme_rdma_mq_ops = { .queue_rq = nvme_rdma_queue_rq, .complete = nvme_rdma_complete_rq, @@ -1528,6 +1536,7 @@ static struct blk_mq_ops nvme_rdma_mq_ops = { .init_hctx = nvme_rdma_init_hctx, .poll = nvme_rdma_poll, .timeout= nvme_rdma_timeout, + .map_queues = nvme_rdma_map_queues, }; static struct blk_mq_ops nvme_rdma_admin_mq_ops = { --
Re: [PATCH] nvme: Fix nvme_get/set_features() with a NULL result pointer
Looks fine, Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme: add hostid token to fabric options
Reviewed-by: Sagi Grimberg
Re: [PATCH v2 08/15] nvmet: make config_item_type const
Acked-by: Sagi Grimberg
Re: [PATCH V3] nvme: fix multiple ctrl removal scheduling
Hi Rkesh, this looks reasonable, but we'll need to also adopt the non-PCI driver to the new state machine. I can give this a spin. At that point we probably want to move nvme_reset into common code somehow. Hi Guys, sorry for barging in late, I've been way too busy with internal stuff lately... I think that adding a new state should (a) be added with careful understanding that its absolutely needed and (b) does not complicate the state machine. I honestly think that adding a new state that says "we scheduled a reset" to address a synchronization issue is not what we should do. 1. I think that state NVME_CTRL_RESETTING semantically means that the reset flow has been scheduled and the state transition atomicity suffices for synchronization. So nvme_reset should change the state and if it succeeded, schedule the reset_work instead of changing the state inside reset_work (like we do in fabrics). At this point we should lose the WARN_ON. 2. I personally think that nvme_probe shouldn't necessarily trigger controller reset, if we can split reset to a couple of useful routines we can reuse them in nvme_probe. The reason is that for reset we need to address various conditions (errors, ongoing traffic etc...) that are not relevant at all for probe. Not sure if anyone agrees with me on this one. I started to experiment with trying to move some of this to nvme core[1] (rdma and loop) but has yet to fully convert pci which is a bit more complicated. [1] git://git.infradead.org/users/sagi/linux.git nvme-central-reset-delete-err
Re: [PATCH V2] nvme: fix nvme_remove going to uninterruptible sleep for ever
/* +* Avoid configuration and syncing commands if controller is already +* being removed and queues have been killed. +*/ + if (ctrl->state == NVME_CTRL_DELETING || ctrl->state == NVME_CTRL_DEAD) + return; + Hey Rakesh, Christoph, Given that the issue is for sync command submission during controller removal, I'm wandering if we should perhaps move this check to __nvme_submit_sync_cmd? AFAICT user-space can just as easily trigger set_features in the same condition which will trigger the hang couldn't it?
Re: [PATCH V2 1/1] nvme: fix multiple ctrl removal scheduling
Allowing multiple resets can result in multiple controller removal as well if different conditions inside nvme_reset_work fail and which might deadlock on device_release_driver. This patch makes sure that work queue item (reset_work) is added only if controller state != NVME_CTRL_RESETTING and that is achieved by moving state change outside nvme_reset_work into nvme_reset and removing old work_busy call. State change is always synchronizated using controller spinlock. So, the reason the state is changed when the work is running rather than queueing is for the window when the state may be set to NVME_CTRL_DELETING, and we don't want the reset work to proceed in that case. What do you think about adding a new state, like NVME_CTRL_SCHED_RESET, then leaving the NVME_CTRL_RESETTING state change as-is? OK, just got to this one. Instead of adding yet another state, how about making controller delete cancel the reset_work (cancel_work_sync)?
Re: [PATCH v3] nvme/pci: remap BAR0 to cover admin CQ doorbell for large stride
Looks good to me, Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme: host: core: fix NULL pointer dereference in nvme_pr_command
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH] IB/iser: fix a comment typo
On 10/4/2015 12:00 PM, Geliang Tang wrote: Just fix a typo in the code comment. Signed-off-by: Geliang Tang Acked-by: Sagi Grimberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IB/core: avoid 32-bit warning
On 10/7/2015 3:29 PM, Arnd Bergmann wrote: The INIT_UDATA() macro requires a pointer or unsigned long argument for both input and output buffer, and all callers had a cast from when the code was merged until a recent restructuring, so now we get core/uverbs_cmd.c: In function 'ib_uverbs_create_cq': core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] This makes the code behave as before by adding back the cast to unsigned long. Signed-off-by: Arnd Bergmann Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq") diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index be4cb9f04be3..88b3b78340f2 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file, if (copy_from_user(&cmd, buf, sizeof(cmd))) return -EFAULT; - INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd), sizeof(resp)); + INIT_UDATA(&ucore, buf, (unsigned long)cmd.response, sizeof(cmd), sizeof(resp)); Would it make sense to cast inside INIT_UDATA() and not have callers worry about it? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] iscsi-target: return -ENOMEM instead of -1 in case of failed kmalloc()
On 10/19/2015 11:18 PM, Luis de Bethencourt wrote: Smatch complains about returning hard coded error codes, silence this warning. drivers/target/iscsi/iscsi_target_parameters.c:211 iscsi_create_default_params() warn: returning -1 instead of -ENOMEM is sloppy Signed-off-by: Luis de Bethencourt Looks good, Reviewed-by: Sagi Grimberg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints
On 06/25/2018 10:08 AM, Johannes Thumshirn wrote: On Tue, Jun 19, 2018 at 05:09:27PM +0300, Sagi Grimberg wrote: We are going to need it for traffic based keep alive to update that we saw a completion to extend the kato. But I suggest you simply keep a ctrl reference in struct nvme_request instead so you don't need to pass it to nvme_complete_req (that's what I did for traffic based keep alive). Do you have a patch for this around? IIRC I started this (as Christoph also suggested it) but it turned out to be quite a big refactoring work. How about the below? patch #1 is what you are looking for, patch #2 is a slightly modified version that applies on #1. Let me know what you think... [1]: -- nvme: cache struct nvme_ctrl reference to struct nvme_request We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Signed-off-by: Sagi Grimberg diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index e8cdb5409725..f53416619905 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -390,6 +390,7 @@ static inline void nvme_clear_nvme_request(struct request *req) if (!(req->rq_flags & RQF_DONTPREP)) { nvme_req(req)->retries = 0; nvme_req(req)->flags = 0; + nvme_req(req)->ctrl = NULL; req->rq_flags |= RQF_DONTPREP; } } @@ -622,8 +623,8 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, return 0; } -blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, - struct nvme_command *cmd) +blk_status_t nvme_setup_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, + struct request *req, struct nvme_command *cmd) { blk_status_t ret = BLK_STS_OK; @@ -652,6 +653,7 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, } cmd->common.command_id = req->tag; + nvme_req(req)->ctrl = ctrl; if (ns) trace_nvme_setup_nvm_cmd(req->q->id, cmd); else diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index b528a2f5826c..99f683ed079e 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2274,7 +2274,7 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx, !nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready)) return nvmf_fail_nonready_command(rq); - ret = nvme_setup_cmd(ns, rq, sqe); + ret = nvme_setup_cmd(&ctrl->ctrl, ns, rq, sqe); if (ret) return ret; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 231807cbc849..e4a2145f3c9a 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -102,6 +102,7 @@ struct nvme_request { u8 retries; u8 flags; u16 status; + struct nvme_ctrl*ctrl; }; /* @@ -419,8 +420,8 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl); #define NVME_QID_ANY -1 struct request *nvme_alloc_request(struct request_queue *q, struct nvme_command *cmd, blk_mq_req_flags_t flags, int qid); -blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, - struct nvme_command *cmd); +blk_status_t nvme_setup_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, + struct request *req, struct nvme_command *cmd); int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, void *buf, unsigned bufflen); int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd, diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fc33804662e7..377e08c70666 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -877,7 +877,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(nvmeq->cq_vector < 0)) return BLK_STS_IOERR; - ret = nvme_setup_cmd(ns, req, &cmnd); + ret = nvme_setup_cmd(&dev->ctrl, ns, req, &cmnd); if (ret) return ret; diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index f621920af823..4de8017da484 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1691,7 +1691,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx, ib_dma_sync_single_for_cpu(dev, sqe->dma, sizeof(struct nvme_command), DMA_TO_DEVICE); - ret = nvme_setup_cmd(ns, rq, c); + ret = nvme_setup_cmd(&queue->ctrl->ctrl, ns, rq, c); if (ret) return ret; diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index d8d91f04bd7e..888bd3fefc4d 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -164,7 +164,7 @@ static blk_status_t nvme_lo
Re: [PATCH v5 1/2] nvme: add tracepoint for nvme_setup_cmd
Looks good Johannes, Reviewed-by: Sagi Grimberg
Re: [PATCH v5 2/2] nvme: add tracepoint for nvme_complete_rq
Reviewed-by: Sagi Grimberg
Re: [PATCH RESENT] nvme-pci: introduce RECONNECTING state to mark initializing procedure
This looks fine to me, but we need Keith to ack on this.
Re: [RFC] NVMe Configuraiton using sysctl
Hi, Hi Oza, we are configuring interrupt coalesce for NVMe, but right now, it uses module param. so the same interrupt coalesce settings get applied for all the NVMEs connected to different RCs. ideally it should be with sysctl. If at all, I would place this in nvme-cli (via ioctl) instead of sysctl. for e.g. sysctl should provide interface to change Per-CPU IO queue pairs, interrupt coalesce settings etc.. My personal feeling is that percpu granularity is a lot to take in for the user, and also can yield some unexpected performance characteristics. But I might be wrong here.. please suggest if we could have/implement sysctl module for NVMe ? I have asked this before, but interrupt coalescing has very little merit without being able to be adaptive. net drivers maintain online stats and schedule interrupt coalescing modifications. Should work in theory, but having said that, interrupt coalescing as a whole is essentially unusable in nvme since the coalescing time limit is in units of 100us increments...
Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem
Hey Keith, Thanks for the fix. It looks like we still have a problem, though. Commands submitted with the "shutdown_lock" held need to be able to make forward progress without relying on a completion, but this one could block indefinitely. Can you explain to me why is the shutdown_lock needed to synchronize nvme_dev_disable? More concretely, how is nvme_dev_disable different from other places where we rely on the ctrl state to serialize stuff? The only reason I see would be to protect against completion-after-abort scenario but I think the block layer should protect against it (checks if the request timeout timer fired).
Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
After Sagi's nvme-rdma: fix concurrent reset and reconnect, the rdma ctrl state is changed to RECONNECTING state after some clearing and shutdown work, then some initializing procedure, no matter reset work path or error recovery path. The fc reset work also does the same thing. So if we define the range that RESET_PREPARE includes scheduling gap and disable and clear work, RESETTING includes initializing procedure, RECONNECTING is very similar with RESETTING. Maybe we could do like this; In nvme fc/rdma - set state to RESET_PREPARE, queue reset_work/err_work - clear/shutdown works, set state to RECONNECTING Should be fine. In nvme pci - set state to RESET_PREPARE, queue reset_work - clear/shutdown works, set state to RESETTING - initialization, set state to LIVE Given that we split reset state and we have a clear symmetry between the transports, do we want to maybe come up with a unique state that is coherent across all transports? Maybe we rename them to NVME_CTRL_SHUTTING_DOWN and NVME_CTRL_ESTABLISHING? I'm open for better names..
Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
see V4 discussion.. :) I didn't see any v4
Re: [PATCH] nvme-pci: calculate iod and avg_seg_size just before use them
The calculation of iod and avg_seg_size maybe meaningless if nvme_pci_use_sgls returns before uses them. So calculate just before use them. The compiler will do the right thing here, but I see what you mean. I think Christoph has some SGL improvements coming, though, that will obviate this. I think that if its not coming for 4.16, it should be easy enough to take it in. FWIW, Reviewed-by: Sagi Grimberg
Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting
Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING before queue the reset work. This is not so strict. There could be a big gap before the reset_work callback is invoked. In addition, there is some disable work in the reset_work callback, strictly speaking, not part of reset work, and could lead to some confusion. This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE and NVME_CTRL_RESETTING. Before queue the reset work, changes state to NVME_CTRL_RESET_PREPARE, after disable work completes, changes state to NVME_CTRL_RESETTING. What guarantees that nvme_timeout will do the right thing? If it is relying on the fact that nvme-pci sets the state to RESETTING only _after_ quiescing the requests queues it needs to be documented as its a real delicate dependency. Suggested-by: Christoph Hellwig Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 17 +++-- drivers/nvme/host/fc.c | 2 ++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c| 28 ++-- drivers/nvme/host/rdma.c | 8 drivers/nvme/target/loop.c | 5 + 6 files changed, 53 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..106a437 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size) int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE)) return -EBUSY; if (!queue_work(nvme_wq, &ctrl->reset_work)) return -EBUSY; @@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; - case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: @@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, break; } break; + + case NVME_CTRL_RESETTING: + switch (old_state) { + case NVME_CTRL_RESET_PREPARE: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_RECONNECTING: switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: changed = true; /* FALLTHRU */ default: @@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: case NVME_CTRL_RECONNECTING: changed = true; /* FALLTHRU */ @@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE]= "live", [NVME_CTRL_RESETTING] = "resetting", + [NVME_CTRL_RESET_PREPARE] = "reset-prepare", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING]= "deleting", [NVME_CTRL_DEAD]= "dead", diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 794e66e..516c1ea 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect @@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl) break; case NVME_CTRL_RESETTING: + case NVME_CTRL_RESET_PREPARE: /* * Controller is already in the process of terminating the * association. No need to do anything further. The reconnect James, can you take a look at this? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e477c35 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO); /* If there is a reset ongoing, we shouldn't reset again. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) + switch (dev->ctrl.state) { + case NVME_CTRL_RESETTING: + case
Re: [PATCH V3 2/2] nvme-pci: fix the timeout case when reset is ongoing
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index e477c35..0530432 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1212,19 +1212,26 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) } /* -* Shutdown immediately if controller times out while starting. The -* reset work will see the pci device disabled when it gets the forced -* cancellation error. All outstanding requests are completed on -* shutdown, so we return BLK_EH_HANDLED. +* There could be two kinds of expired reqs when reset is ongoing. +* - Outstanding IO or admin requests from previous work before the +*nvme_reset_work invokes nvme_dev_disable. Handle them as the +*nvme_cancel_request. +* - Outstanding admin requests from the initializing procedure. +*Set NVME_REQ_CANCELLED flag on them, then nvme_reset_work will +*see the error, then disable the device and remove the ctrl. */ - if (dev->ctrl.state == NVME_CTRL_RESETTING) { - dev_warn(dev->ctrl.device, -"I/O %d QID %d timeout, disable controller\n", -req->tag, nvmeq->qid); - nvme_dev_disable(dev, false); + switch (dev->ctrl.state) { + case NVME_CTRL_RESET_PREPARE: + nvme_req(req)->status = NVME_SC_ABORT_REQ; + return BLK_EH_HANDLED; + case NVME_CTRL_RESETTING: + WARN_ON_ONCE(nvmeq->qid); nvme_req(req)->flags |= NVME_REQ_CANCELLED; return BLK_EH_HANDLED; + default: + break; } + Indentation is off...
Re: [PATCH] drviers/nvme/target: fix max dup length for kstrndup
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
- if (nvmeq->sq_cmds_io) - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd)); - else - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC the _toio() variant enforces alignment, does the copy with 4 byte stores, and has a full barrier after the copy. In comparison our regular memcpy() does none of those things and may use unaligned and vector load/stores. For normal (cacheable) memory that is perfectly fine, but they can cause alignment faults when targeted at MMIO (cache-inhibited) memory. I think in this particular case it might be ok since we know SEQs are aligned to 64 byte boundaries and the copy is too small to use our vectorised memcpy(). I'll assume we don't need explicit ordering between writes of SEQs since the existing code doesn't seem to care unless the doorbell is being rung, so you're probably fine there too. That said, I still think this is a little bit sketchy and at the very least you should add a comment explaining what's going on when the CMB is being used. If someone more familiar with the NVMe driver could chime in I would appreciate it. I may not be understanding the concern, but I'll give it a shot. You're right, the start of any SQE is always 64-byte aligned, so that should satisfy alignment requirements. The order when writing multiple/successive SQEs in a submission queue does matter, and this is currently serialized through the q_lock. The order in which the bytes of a single SQE is written doesn't really matter as long as the entire SQE is written into the CMB prior to writing that SQ's doorbell register. The doorbell register is written immediately after copying a command entry into the submission queue (ignore "shadow buffer" features), so the doorbells written to commands submitted is 1:1. If a CMB SQE and DB order is not enforced with the memcpy, then we do need a barrier after the SQE's memcpy and before the doorbell's writel. Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update ordering always guaranteed? If you look at mlx4 (rdma device driver) that works exactly the same as nvme you will find: -- qp->sq.head += nreq; /* * Make sure that descriptors are written before * doorbell record. */ wmb(); writel(qp->doorbell_qpn, to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL); /* * Make sure doorbells don't leak out of SQ spinlock * and reach the HCA out of order. */ mmiowb(); -- That seems to explicitly make sure to place a barrier before updating the doorbell. So as I see it, either ordering is guaranteed and the above code is redundant, or nvme needs to do the same. Thoughts?
Re: [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests
On 03/08/2018 08:19 AM, Jianchao Wang wrote: Currently, we use nvme_cancel_request to complete the request forcedly. This has following defects: - It is not safe to race with the normal completion path. blk_mq_complete_request is ok to race with timeout path, but not with itself. - Cannot ensure all the requests have been handled. The timeout path may grab some expired requests, blk_mq_complete_request cannot touch them. add two helper interface to flush in-flight requests more safely. - nvme_abort_requests_sync use nvme_abort_req to timeout all the in-flight requests and wait until timeout work and irq completion path completes. More details please refer to the comment of this interface. - nvme_flush_aborted_requests complete the requests 'aborted' by nvme_abort_requests_sync. It will be invoked after the controller is disabled/shutdown. Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 96 drivers/nvme/host/nvme.h | 4 +- 2 files changed, 99 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 7b8df47..e26759b 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3567,6 +3567,102 @@ void nvme_start_queues(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_start_queues); +static void nvme_abort_req(struct request *req, void *data, bool reserved) +{ + if (!blk_mq_request_started(req)) + return; + + dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device, + "Abort I/O %d", req->tag); + + /* The timeout path need identify this flag and return +* BLK_EH_NOT_HANDLED, then the request will not be completed. +* we will defer the completion after the controller is disabled or +* shutdown. +*/ + set_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags); + blk_abort_request(req); +} + +/* + * This function will ensure all the in-flight requests on the + * controller to be handled by timeout path or irq completion path. + * It has to pair with nvme_flush_aborted_requests which will be + * invoked after the controller has been disabled/shutdown and + * complete the requests 'aborted' by nvme_abort_req. + * + * Note, the driver layer will not be quiescent before disable + * controller, because the requests aborted by blk_abort_request + * are still active and the irq will fire at any time, but it can + * not enter into completion path, because the request has been + * timed out. + */ +void nvme_abort_requests_sync(struct nvme_ctrl *ctrl) +{ + struct nvme_ns *ns; + + blk_mq_tagset_busy_iter(ctrl->tagset, nvme_abort_req, ctrl); + blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_abort_req, ctrl); + /* +* ensure the timeout_work is queued, thus needn't to sync +* the timer +*/ + kblockd_schedule_work(&ctrl->admin_q->timeout_work); + + down_read(&ctrl->namespaces_rwsem); + + list_for_each_entry(ns, &ctrl->namespaces, list) + kblockd_schedule_work(&ns->queue->timeout_work); + + list_for_each_entry(ns, &ctrl->namespaces, list) + flush_work(&ns->queue->timeout_work); + + up_read(&ctrl->namespaces_rwsem); + /* This will ensure all the nvme irq completion path have exited */ + synchronize_sched(); +} +EXPORT_SYMBOL_GPL(nvme_abort_requests_sync); + +static void nvme_comp_req(struct request *req, void *data, bool reserved) Not a very good name... +{ + struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data; + + if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags)) + return; + + WARN_ON(!blk_mq_request_started(req)); + + if (ctrl->tagset && ctrl->tagset->ops->complete) { What happens when this called on the admin tagset when the controller does not have an io tagset? + clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags); + /* +* We set the status to NVME_SC_ABORT_REQ, then ioq request +* will be requeued and adminq one will be failed. +*/ + nvme_req(req)->status = NVME_SC_ABORT_REQ; + /* +* For ioq request, blk_mq_requeue_request should be better +* here. But the nvme code will still setup the cmd even if +* the RQF_DONTPREP is set. We have to use .complete to free +* the cmd and then requeue it. IMO, its better to fix nvme to not setup the command if RQF_DONTPREP is on (other than the things it must setup). +* +* For adminq request, invoking .complete directly will miss +* blk_mq_sched_completed_request, but this is ok because we +* won't have io scheduler for adminq. +*/ + ctrl->tagset->ops->complete(req); I don't think that directly calling .complete
Re: [PATCH V10 09/19] block: introduce bio_bvecs()
Not sure I understand the 'blocking' problem in this case. We can build a bvec table from this req, and send them all in send(), I would like to avoid growing bvec tables and keep everything preallocated. Plus, a bvec_iter operates on a bvec which means we'll need a table there as well... Not liking it so far... can this way avoid your blocking issue? You may see this example in branch 'rq->bio != rq->biotail' of lo_rw_aio(). This is exactly an example of not ignoring the bios... If this way is what you need, I think you are right, even we may introduce the following helpers: rq_for_each_bvec() rq_bvecs() I'm not sure how this helps me either. Unless we can set a bvec_iter to span bvecs or have an abstract bio crossing when we re-initialize the bvec_iter I don't see how I can ignore bios completely... So looks nvme-tcp host driver might be the 2nd driver which benefits from multi-page bvec directly. The multi-page bvec V11 has passed my tests and addressed almost all the comments during review on V10. I removed bio_vecs() in V11, but it won't be big deal, we can introduce them anytime when there is the requirement. multipage-bvecs and nvme-tcp are going to conflict, so it would be good to coordinate on this. I think that nvme-tcp host needs some adjustments as setting a bvec_iter. I'm under the impression that the change is rather small and self-contained, but I'm not sure I have the full picture here.
Re: [PATCH] nvme-rdma: complete requests from ->timeout
This does not hold at least for NVMe RDMA host driver. An example scenario is when the RDMA connection is gone while the controller is being deleted. In this case, the nvmf_reg_write32() for sending shutdown admin command by the delete_work could be hung forever if the command is not completed by the timeout handler. If the queue is gone, this means that the queue has already flushed and any commands that were inflight has completed with a flush error completion... Can you describe the scenario that caused this hang? When has the queue became "gone" and when did the shutdown command execute?
Re: [PATCH] nvme-rdma: complete requests from ->timeout
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880 schedule+0x36/0x80 schedule_timeout+0x161/0x300 ? __next_timer_interrupt+0xe0/0xe0 io_schedule_timeout+0x1e/0x50 wait_for_completion_io_timeout+0x130/0x1a0 ? wake_up_q+0x80/0x80 blk_execute_rq+0x6e/0xa0 __nvme_submit_sync_cmd+0x6e/0xe0 nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics] ? wait_for_completion_interruptible_timeout+0x157/0x1b0 nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma] nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma] nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma] process_one_work+0x179/0x390 worker_thread+0x4f/0x3e0 kthread+0x105/0x140 ? max_active_store+0x80/0x80 ? kthread_bind+0x20/0x20 This bug is produced by setting MTU of RoCE interface to '568' for test while running I/O traffics. I think that with the latest changes from Keith we can no longer rely on blk-mq to barrier racing completions. We will probably need to barrier ourselves in nvme-rdma...
[PATCH v5 10/13] nvme-tcp: Add protocol header
From: Sagi Grimberg Signed-off-by: Sagi Grimberg --- include/linux/nvme-tcp.h | 189 +++ include/linux/nvme.h | 1 + 2 files changed, 190 insertions(+) create mode 100644 include/linux/nvme-tcp.h diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h new file mode 100644 index ..03d87c0550a9 --- /dev/null +++ b/include/linux/nvme-tcp.h @@ -0,0 +1,189 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe over Fabrics TCP protocol header. + * Copyright (c) 2018 Lightbits Labs. All rights reserved. + */ + +#ifndef _LINUX_NVME_TCP_H +#define _LINUX_NVME_TCP_H + +#include + +#define NVME_TCP_DISC_PORT 8009 +#define NVME_TCP_ADMIN_CCSZSZ_8K +#define NVME_TCP_DIGEST_LENGTH 4 + +enum nvme_tcp_pfv { + NVME_TCP_PFV_1_0 = 0x0, +}; + +enum nvme_tcp_fatal_error_status { + NVME_TCP_FES_INVALID_PDU_HDR= 0x01, + NVME_TCP_FES_PDU_SEQ_ERR= 0x02, + NVME_TCP_FES_HDR_DIGEST_ERR = 0x03, + NVME_TCP_FES_DATA_OUT_OF_RANGE = 0x04, + NVME_TCP_FES_R2T_LIMIT_EXCEEDED = 0x05, + NVME_TCP_FES_DATA_LIMIT_EXCEEDED= 0x05, + NVME_TCP_FES_UNSUPPORTED_PARAM = 0x06, +}; + +enum nvme_tcp_digest_option { + NVME_TCP_HDR_DIGEST_ENABLE = (1 << 0), + NVME_TCP_DATA_DIGEST_ENABLE = (1 << 1), +}; + +enum nvme_tcp_pdu_type { + nvme_tcp_icreq = 0x0, + nvme_tcp_icresp = 0x1, + nvme_tcp_h2c_term = 0x2, + nvme_tcp_c2h_term = 0x3, + nvme_tcp_cmd= 0x4, + nvme_tcp_rsp= 0x5, + nvme_tcp_h2c_data = 0x6, + nvme_tcp_c2h_data = 0x7, + nvme_tcp_r2t= 0x9, +}; + +enum nvme_tcp_pdu_flags { + NVME_TCP_F_HDGST= (1 << 0), + NVME_TCP_F_DDGST= (1 << 1), + NVME_TCP_F_DATA_LAST= (1 << 2), + NVME_TCP_F_DATA_SUCCESS = (1 << 3), +}; + +/** + * struct nvme_tcp_hdr - nvme tcp pdu common header + * + * @type: pdu type + * @flags: pdu specific flags + * @hlen: pdu header length + * @pdo: pdu data offset + * @plen: pdu wire byte length + */ +struct nvme_tcp_hdr { + __u8type; + __u8flags; + __u8hlen; + __u8pdo; + __le32 plen; +}; + +/** + * struct nvme_tcp_icreq_pdu - nvme tcp initialize connection request pdu + * + * @hdr: pdu generic header + * @pfv: pdu version format + * @hpda: host pdu data alignment (dwords, 0's based) + * @digest:digest types enabled + * @maxr2t:maximum r2ts per request supported + */ +struct nvme_tcp_icreq_pdu { + struct nvme_tcp_hdr hdr; + __le16 pfv; + __u8hpda; + __u8digest; + __le32 maxr2t; + __u8rsvd2[112]; +}; + +/** + * struct nvme_tcp_icresp_pdu - nvme tcp initialize connection response pdu + * + * @hdr: pdu common header + * @pfv: pdu version format + * @cpda: controller pdu data alignment (dowrds, 0's based) + * @digest:digest types enabled + * @maxdata: maximum data capsules per r2t supported + */ +struct nvme_tcp_icresp_pdu { + struct nvme_tcp_hdr hdr; + __le16 pfv; + __u8cpda; + __u8digest; + __le32 maxdata; + __u8rsvd[112]; +}; + +/** + * struct nvme_tcp_term_pdu - nvme tcp terminate connection pdu + * + * @hdr: pdu common header + * @fes: fatal error status + * @fei: fatal error information + */ +struct nvme_tcp_term_pdu { + struct nvme_tcp_hdr hdr; + __le16 fes; + __le32 fei; + __u8rsvd[8]; +}; + +/** + * struct nvme_tcp_cmd_pdu - nvme tcp command capsule pdu + * + * @hdr: pdu common header + * @cmd: nvme command + */ +struct nvme_tcp_cmd_pdu { + struct nvme_tcp_hdr hdr; + struct nvme_command cmd; +}; + +/** + * struct nvme_tcp_rsp_pdu - nvme tcp response capsule pdu + * + * @hdr: pdu common header + * @hdr: nvme-tcp generic header + * @cqe: nvme completion queue entry + */ +struct nvme_tcp_rsp_pdu { + struct nvme_tcp_hdr hdr; + struct nvme_completion cqe; +}; + +/** + * struct nvme_tcp_r2t_pdu - nvme tcp ready-to-transfer pdu + * + * @hdr: pdu common header + * @command_id:nvme command identifier which this relates to + * @ttag: transfer tag (controller generated) + * @r2t_offset:offset from the start of the command data + * @r2t_length:length the host is allowed to send + */ +struct nvme_tcp_r2t_pdu { +
[PATCH v5 03/13] iov_iter: pass void csum pointer to csum_and_copy_to_iter
From: Sagi Grimberg The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram and we are trying to unite its logic with skb_copy_datagram_iter by passing a callback to the copy function that we want to apply. Thus, we need to make the checksum pointer private to the function. Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- include/linux/uio.h | 2 +- lib/iov_iter.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/uio.h b/include/linux/uio.h index 55ce99ddb912..41d1f8d3313d 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -266,7 +266,7 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count) { i->count = count; } -size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); +size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i); size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 7ebccb5c1637..db93531ca3e3 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1432,10 +1432,11 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, } EXPORT_SYMBOL(csum_and_copy_from_iter_full); -size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, +size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i) { const char *from = addr; + __wsum *csum = csump; __wsum sum, next; size_t off = 0; sum = *csum; -- 2.17.1
[PATCH v5 02/13] datagram: open-code copy_page_to_iter
From: Sagi Grimberg This will be useful to consolidate skb_copy_and_hash_datagram_iter and skb_copy_and_csum_datagram to a single code path. Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- net/core/datagram.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index 57f3a6fcfc1e..abe642181b64 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -445,11 +445,14 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, end = start + skb_frag_size(frag); if ((copy = end - offset) > 0) { + struct page *page = skb_frag_page(frag); + u8 *vaddr = kmap(page); + if (copy > len) copy = len; - n = copy_page_to_iter(skb_frag_page(frag), - frag->page_offset + offset - - start, copy, to); + n = copy_to_iter(vaddr + frag->page_offset + +offset - start, copy, to); + kunmap(page); offset += n; if (n != copy) goto short_copy; -- 2.17.1
Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints
Looks good (both #1 if we want to use the two patch version or #2). I have no idea what I did when I was trying to do when I tried it. You want to submit or should I?
Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints
You want to submit or should I? Whatever is easier for you You're the owner, you should go ahead with it. Thanks!
Re: [PATCH] nvme: trace: add disk name to tracepoints
Not related to your patch, but I did notice that the req->q->id isn't really useful here since that's not the hardware context identifier. That's just some ida assigned software identifier. For the admin command completion trace, it's actually a little confusing to see the qid in the trace. It was actually requested by Martin so we can easily see which request got dispatched/completed on which request queue. Would be good in the future to display the hwqid, but we'll need to work before we can do that. Maybe change the display to request_queue_id=%d?
Re: [PATCH v4 1/2] nvme: cache struct nvme_ctrl reference to struct nvme_request
@@ -652,6 +653,7 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, } cmd->common.command_id = req->tag; + nvme_req(req)->ctrl = ctrl; if (ns) trace_nvme_setup_nvm_cmd(req->q->id, cmd); else I don't think we need to do this per-io. The request coming from the controller's tagset, so we can do this just once at .init_request time, right? Agreed, but as it's Sagi's patch I'd like to hear his opinion on this before changing it. That's a good idea! seems to work... Johannes, can you respin with this version? Please CC james and keith on it too. -- nvme: cache struct nvme_ctrl reference to struct nvme_request We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Signed-off-by: Sagi Grimberg diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 41d45a1b5c62..9cc33752539a 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -1737,6 +1737,7 @@ nvme_fc_init_request(struct blk_mq_tag_set *set, struct request *rq, int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0; struct nvme_fc_queue *queue = &ctrl->queues[queue_idx]; + nvme_req(rq)->ctrl = &ctrl->ctrl; return __nvme_fc_init_request(ctrl, queue, op, rq, queue->rqcnt++); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 0c4a33df3b2f..f2249387b60d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -102,6 +102,7 @@ struct nvme_request { u8 retries; u8 flags; u16 status; + struct nvme_ctrl*ctrl; }; /* diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index ba943f211687..d2bd60d8315a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -418,6 +418,7 @@ static int nvme_init_request(struct blk_mq_tag_set *set, struct request *req, BUG_ON(!nvmeq); iod->nvmeq = nvmeq; + nvme_req(req)->ctrl = &dev->ctrl; return 0; } diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 20db67d53ffa..e4087caf8754 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -292,6 +292,7 @@ static int nvme_rdma_init_request(struct blk_mq_tag_set *set, return ret; req->queue = queue; + nvme_req(rq)->ctrl = &ctrl->ctrl; return 0; } diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index d8d91f04bd7e..af7fbf4132b0 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -227,6 +227,7 @@ static int nvme_loop_init_request(struct blk_mq_tag_set *set, { struct nvme_loop_ctrl *ctrl = set->driver_data; + nvme_req(req)->ctrl = &ctrl->ctrl; return nvme_loop_init_iod(ctrl, blk_mq_rq_to_pdu(req), (set == &ctrl->tag_set) ? hctx_idx + 1 : 0); } --
Re: [PATCH] nvme: trace: add disk name to tracepoints
Thinking more on this, not using the hw qid really limits the utility out of using these trace events: We may not be able to match a completion to the submission without it since cmdid alone isn't enough to match up the two events. Here's an updated proposal and actually tested. I was also able to combine admin and io submissions. This looks good! One thing that is missing is the controller id when we have multiple controllers/subsystems in the host like: TP_printk("nvme%d: qid=%d, nsid=%u, cmdid=%u, flags=0x%x, meta=0x%llx, cmd=(%s %s)", But Johannes can add it I guess.
Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints
On 06/11/2018 09:46 PM, Johannes Thumshirn wrote: Add disk name to tracepoints so we can better distinguish between individual disks in the trace output. Signed-off-by: Johannes Thumshirn --- I'm not entirely sure if this adding the ctrl pointers to nvme_complete_rq() and nvme_setup_cmd() is a good idea, it's the fast-path after all. We are going to need it for traffic based keep alive to update that we saw a completion to extend the kato. But I suggest you simply keep a ctrl reference in struct nvme_request instead so you don't need to pass it to nvme_complete_req (that's what I did for traffic based keep alive).
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Hi Folks, I'm sorry to chime in super late on this, but a lot has been going on for me lately which got me off the grid. So I'll try to provide my input hopefully without starting any more flames.. This patch series aims to provide a more fine grained control over nvme's native multipathing, by allowing it to be switched on and off on a per-subsystem basis instead of a big global switch. No. The only reason we even allowed to turn multipathing off is because you complained about installer issues. The path forward clearly is native multipathing and there will be no additional support for the use cases of not using it. We all basically knew this would be your position. But at this year's LSF we pretty quickly reached consensus that we do in fact need this. Except for yourself, Sagi and afaik Martin George: all on the cc were in attendance and agreed. Correction, I wasn't able to attend LSF this year (unfortunately). And since then we've exchanged mails to refine and test Johannes' implementation. You've isolated yourself on this issue. Please just accept that we all have a pretty solid command of what is needed to properly provide commercial support for NVMe multipath. The ability to switch between "native" and "other" multipath absolutely does _not_ imply anything about the winning disposition of native vs other. It is purely about providing commercial flexibility to use whatever solution makes sense for a given environment. The default _is_ native NVMe multipath. It is on userspace solutions for "other" multipath (e.g. multipathd) to allow user's to whitelist an NVMe subsystem to be switched to "other". Hopefully this clarifies things, thanks. Mike, I understand what you're saying, but I also agree with hch on the simple fact that this is a burden on linux nvme (although less passionate about it than hch). Beyond that, this is going to get much worse when we support "dispersed namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed namespaces" makes NVMe namespaces share-able over different subsystems so changing the personality on a per-subsystem basis is just asking for trouble. Moreover, I also wanted to point out that fabrics array vendors are building products that rely on standard nvme multipathing (and probably multipathing over dispersed namespaces as well), and keeping a knob that will keep nvme users with dm-multipath will probably not help them educate their customers as well... So there is another angle to this. Don't get me wrong, I do support your cause, and I think nvme should try to help, I just think that subsystem granularity is not the correct approach going forward. As I said, I've been off the grid, can you remind me why global knob is not sufficient? This might sound stupid to you, but can't users that desperately must keep using dm-multipath (for its mature toolset or what-not) just stack it on multipath nvme device? (I might be completely off on this so feel free to correct my ignorance).
Re: [PATCH] nvme: trace: add disk name to tracepoints
Add disk name to tracepoints so we can better destinguish between individual disks in the trace output. Looks useful, Reviewed-by: Sagi Grimberg
Re: [PATCH -next] nvmet: fix a typo in nvmet_file_ns_enable()
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/3] nvme: provide a way to disable nvme mpath per subsystem
@@ -246,3 +246,27 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) blk_cleanup_queue(head->disk->queue); put_disk(head->disk); } + +int nvme_mpath_change_personality(struct nvme_subsystem *subsys) +{ + struct nvme_ctrl *ctrl; + int ret = 0; + +restart: + mutex_lock(&subsys->lock); + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (!list_empty(&ctrl->namespaces)) { + mutex_unlock(&subsys->lock); + nvme_remove_namespaces(ctrl); This looks completely broken. Any of these namespaces can have an active handle on it.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO when features like "dispersed namespaces" land a negative check would need to be added in the code to prevent switching from "native". And once something like "dispersed namespaces" lands we'd then have to see about a more sophisticated switch that operates at a different granularity. Could also be that switching one subsystem that is part of "dispersed namespaces" would then cascade to all other associated subsystems? Not that dissimilar from the 3rd patch in this series that allows a 'device' switch to be done in terms of the subsystem. Which I think is broken by allowing to change this personality on the fly. Anyway, I don't know the end from the beginning on something you just told me about ;) But we're all in this together. And can take it as it comes. I agree but this will be exposed to user-space and we will need to live with it for a long long time... I'm merely trying to bridge the gap from old dm-multipath while native NVMe multipath gets its legs. In time I really do have aspirations to contribute more to NVMe multipathing. I think Christoph's NVMe multipath implementation of bio-based device ontop on NVMe core's blk-mq device(s) is very clever and effective (blk_steal_bios() hack and all). That's great. Don't get me wrong, I do support your cause, and I think nvme should try to help, I just think that subsystem granularity is not the correct approach going forward. I understand there will be limits to this 'mpath_personality' knob's utility and it'll need to evolve over time. But the burden of making more advanced NVMe multipath features accessible outside of native NVMe isn't intended to be on any of the NVMe maintainers (other than maybe remembering to disallow the switch where it makes sense in the future). I would expect that any "advanced multipath features" would be properly brought up with the NVMe TWG as a ratified standard and find its way to nvme. So I don't think this particularly is a valid argument. As I said, I've been off the grid, can you remind me why global knob is not sufficient? Because once nvme_core.multipath=N is set: native NVMe multipath is then not accessible from the same host. The goal of this patchset is to give users choice. But not limit them to _only_ using dm-multipath if they just have some legacy needs. Tough to be convincing with hypotheticals but I could imagine a very obvious usecase for native NVMe multipathing be PCI-based embedded NVMe "fabrics" (especially if/when the numa-based path selector lands). But the same host with PCI NVMe could be connected to a FC network that has historically always been managed via dm-multipath.. but say that FC-based infrastructure gets updated to use NVMe (to leverage a wider NVMe investment, whatever?) -- but maybe admins would still prefer to use dm-multipath for the NVMe over FC. You are referring to an array exposing media via nvmf and scsi simultaneously? I'm not sure that there is a clean definition of how that is supposed to work (ANA/ALUA, reservations, etc..) This might sound stupid to you, but can't users that desperately must keep using dm-multipath (for its mature toolset or what-not) just stack it on multipath nvme device? (I might be completely off on this so feel free to correct my ignorance). We could certainly pursue adding multipath-tools support for native NVMe multipathing. Not opposed to it (even if just reporting topology and state). But given the extensive lengths NVMe multipath goes to hide devices we'd need some way to piercing through the opaque nvme device that native NVMe multipath exposes. But that really is a tangent relative to this patchset. Since that kind of visibility would also benefit the nvme cli... otherwise how are users to even be able to trust but verify native NVMe multipathing did what it expected it to? Can you explain what is missing for multipath-tools to resolve topology? nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl topology but that is sort of the important information as controllers are the real paths.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Moreover, I also wanted to point out that fabrics array vendors are building products that rely on standard nvme multipathing (and probably multipathing over dispersed namespaces as well), and keeping a knob that will keep nvme users with dm-multipath will probably not help them educate their customers as well... So there is another angle to this. Noticed I didn't respond directly to this aspect. As I explained in various replies to this thread: The users/admins would be the ones who would decide to use dm-multipath. It wouldn't be something that'd be imposed by default. If anything, the all-or-nothing nvme_core.multipath=N would pose a much more serious concern for these array vendors that do have designs to specifically leverage native NVMe multipath. Because if users were to get into the habit of setting that on the kernel commandline they'd literally _never_ be able to leverage native NVMe multipathing. We can also add multipath.conf docs (man page, etc) that caution admins to consult their array vendors about whether using dm-multipath is to be avoided, etc. Again, this is opt-in, so on a upstream Linux kernel level the default of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH is configured). Not seeing why there is so much angst and concern about offering this flexibility via opt-in but I'm also glad we're having this discussion to have our eyes wide open. I think that the concern is valid and should not be dismissed. And at times flexibility is a real source of pain, both to users and developers. The choice is there, no one is forbidden to use multipath. I'm just still not sure exactly why the subsystem granularity is an absolute must other than a volume exposed as a nvmf namespace and scsi lun (how would dm-multipath detect this is the same device btw?)
Re: [PATCHv2 1/5] nvme: fix lockdep warning in nvme_mpath_clear_current_path
Reviewed-by: Sagi Grimberg
Re: [PATCH 2/5] nvme: don't hold nvmf_transports_rwsem for more than transport lookups
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 7ae732a77fe8..febf82639b40 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -957,16 +957,17 @@ nvmf_create_ctrl(struct device *dev, const char *buf, size_t count) down_read(&nvmf_transports_rwsem); ops = nvmf_lookup_transport(opts); + up_read(&nvmf_transports_rwsem); And what protects us from the transport getting unregister right here from anothet thread waiting to acquire nvmf_transports_rwsem? I think having the module_get inside as well would protect against it.
Re: [PATCH 3/5] nvme: call nvmf_create_ctrl before checking for duplicate assignment
In general it seems like fc loop needs to offload any I/O to a workqueue just like nvme-loop does, but even then I can't see how that is going to cause an issue in this area. Hmm I'll be looking into it. FWIW, I agree it should do that as well.
Re: [PATCH 4/5] nvmet: use atomic allocations when allocating fc requests
diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c index 34712def81b1..d2209c60f95f 100644 --- a/drivers/nvme/target/fcloop.c +++ b/drivers/nvme/target/fcloop.c @@ -509,7 +509,7 @@ fcloop_fcp_req(struct nvme_fc_local_port *localport, if (!rport->targetport) return -ECONNREFUSED; - tfcp_req = kzalloc(sizeof(*tfcp_req), GFP_KERNEL); + tfcp_req = kzalloc(sizeof(*tfcp_req), GFP_ATOMIC); Question, why isn't tfcp_req embedded in fcpreq? don't they have the same lifetime? if (!tfcp_req) return -ENOMEM;
Re: [PATCH 5/5] nvmet: fcloop: use irqsave spinlocks
Lockdep complains about inconsistent hardirq states when using nvme-fcloop. So use the irqsave variant of spin_locks for the nvmefc_fcp_req's reqlock. Is this because of the nvmet bdev can trigger rx path in interrupt context? Don't understand exactly what is violated. Here's the lockdep report: [ 11.138417] [ 11.139085] WARNING: inconsistent lock state [ 11.139207] nvmet: creating controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:046394e7-bff7-4403-9502-1816800efaa0. [ 11.139727] 4.17.0-rc5+ #883 Not tainted [ 11.139732] [ 11.11] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. [ 11.145341] swapper/8/0 [HC1[1]:SC0[0]:HE0:SE1] takes: [ 11.146108] (ptrval) (&(&tfcp_req->reqlock)->rlock){?.+.}, at: fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop] [ 11.148633] {HARDIRQ-ON-W} state was registered at: [ 11.149380] _raw_spin_lock+0x34/0x70 [ 11.149940] fcloop_fcp_recv_work+0x17/0x90 [nvme_fcloop] [ 11.150746] process_one_work+0x1d8/0x5c0 [ 11.151346] worker_thread+0x45/0x3e0 [ 11.151886] kthread+0x101/0x140 [ 11.152370] ret_from_fork+0x3a/0x50 [ 11.152903] irq event stamp: 3 [ 11.153402] hardirqs last enabled at (36663): [] default_idle+0x13/0x180 [ 11.154601] hardirqs last disabled at (36664): [] interrupt_entry+0xc4/0xe0 [ 11.155817] softirqs last enabled at (3): [] irq_enter+0x59/0x60 [ 11.156952] softirqs last disabled at (36665): [] irq_enter+0x3e/0x60 [ 11.158092] [ 11.158092] other info that might help us debug this: [ 11.159436] Possible unsafe locking scenario: [ 11.159436] [ 11.160299]CPU0 [ 11.160663] [ 11.161026] lock(&(&tfcp_req->reqlock)->rlock); [ 11.161709] [ 11.162091] lock(&(&tfcp_req->reqlock)->rlock); [ 11.163148] [ 11.163148] *** DEADLOCK *** [ 11.163148] [ 11.164007] no locks held by swapper/8/0. [ 11.164596] [ 11.164596] stack backtrace: [ 11.165238] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.17.0-rc5+ #883 [ 11.166180] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 [ 11.167673] Call Trace: [ 11.168037] [ 11.168349] dump_stack+0x78/0xb3 [ 11.168864] print_usage_bug+0x1ed/0x1fe [ 11.169440] ? check_usage_backwards+0x110/0x110 [ 11.170111] mark_lock+0x263/0x2a0 [ 11.170995] __lock_acquire+0x675/0x1870 [ 11.171865] ? __lock_acquire+0x48d/0x1870 [ 11.172460] lock_acquire+0xd4/0x220 [ 11.172981] ? fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop] [ 11.173709] _raw_spin_lock+0x34/0x70 [ 11.174243] ? fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop] [ 11.174978] fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop] [ 11.175673] nvmet_fc_transfer_fcp_data+0x9b/0x130 [nvmet_fc] [ 11.176507] nvmet_req_complete+0x10/0x110 [nvmet] [ 11.177210] nvmet_bio_done+0x23/0x40 [nvmet] [ 11.177837] blk_update_request+0xab/0x3b0 [ 11.178434] blk_mq_end_request+0x13/0x60 [ 11.179033] flush_smp_call_function_queue+0x58/0x150 [ 11.179755] smp_call_function_single_interrupt+0x49/0x260 [ 11.180531] call_function_single_interrupt+0xf/0x20 [ 11.181236] [ 11.181542] RIP: 0010:native_safe_halt+0x2/0x10 [ 11.182186] RSP: 0018:a481006cbec0 EFLAGS: 0206 ORIG_RAX: ff04 [ 11.183265] RAX: 9f54f8b86200 RBX: 9f54f8b86200 RCX: [ 11.184264] RDX: 9f54f8b86200 RSI: 0001 RDI: 9f54f8b86200 [ 11.185270] RBP: 0008 R08: R09: [ 11.186271] R10: 0001 R11: R12: [ 11.187280] R13: R14: 9f54f8b86200 R15: 9f54f8b86200 [ 11.188280] default_idle+0x18/0x180 [ 11.188806] do_idle+0x176/0x260 [ 11.189273] cpu_startup_entry+0x5a/0x60 [ 11.189832] start_secondary+0x18f/0x1b0 Signed-off-by: Johannes Thumshirn Cc: James Smart --- drivers/nvme/target/fcloop.c | 52 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c index d2209c60f95f..f4a3700201c1 100644 --- a/drivers/nvme/target/fcloop.c +++ b/drivers/nvme/target/fcloop.c @@ -406,9 +406,10 @@ fcloop_fcp_recv_work(struct work_struct *work) container_of(work, struct fcloop_fcpreq, fcp_rcv_work); struct nvmefc_fcp_req *fcpreq = tfcp_req->fcpreq; int ret = 0; + unsigned long flags; bool aborted = false; - spin_lock(&tfcp_req->reqlock); + spin_lock_irqsave(&tfcp_req->reqlock, flags); switch (tfcp_req->inistate) { case INI_IO_START: tfcp_req->inistate = INI_IO_ACTIVE; @@ -417,11 +418,11 @@ fcloop_fcp_recv_work(struct work_struct *work) aborted = true; break; default: - spin_unlock(&tfcp_req->reqlock); +
Re: [PATCH -next] nvmet: fix error return code in nvmet_file_ns_enable()
Reviewed-by: Sagi Grimberg
Re: [PATCH 0/3] Provide more fine grained control over multipathing
I'm aware that most everything in multipath.conf is SCSI/FC specific. That isn't the point. dm-multipath and multipathd are an existing framework for managing multipath storage. It could be made to work with NVMe. But yes it would not be easy. Especially not with the native NVMe multipath crew being so damn hostile. The resistance is not a hostile act. Please try and keep the discussion technical. But I don't think the burden of allowing multipathd/DM to inject themselves into the path transition state machine has any benefit whatsoever to the user. It's only complicating things and therefore we'd be doing people a disservice rather than a favor. This notion that only native NVMe multipath can be successful is utter bullshit. And the mere fact that I've gotten such a reaction from a select few speaks to some serious control issues. Imagine if XFS developers just one day imposed that it is the _only_ filesystem that can be used on persistent memory. Just please dial it back.. seriously tiresome. Mike, you make a fair point on multipath tools being more mature compared to NVMe multipathing. But this is not the discussion at all (at least not from my perspective). There was not a single use-case that gave a clear-cut justification for a per-subsystem personality switch (other than some far fetched imaginary scenarios). This is not unusual for the kernel community not to accept things with little to no use, especially when it involves exposing a userspace ABI. As for now, all I see is a disclaimer saying that it'd need to be nurtured over time as the NVMe spec evolves. Can you (or others) please try and articulate why a "fine grained" multipathing is an absolute must? At the moment, I just don't understand. Also, I get your point that exposing state/stats information to userspace is needed. That's a fair comment.
Re: [PATCH v2] Bugfix for handling of shadow doorbell buffer.
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 17a0190bd88f..4452f8553301 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -306,6 +306,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, u32 *dbbuf_db, old_value = *dbbuf_db; *dbbuf_db = value; + /* +* Ensure that the doorbell is updated before reading +* the EventIdx from memory. NVMe controller should have +* similar ordering guarantees - update EventIdx before +* reading doorbell. +*/ + mb(); + if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value)) return false; } Reviewed-by: Sagi Grimberg
Re: [PATCH] scatterlist: Update size type to support greater then 4GB size.
struct nvme_sgl_desc { __le64 addr; - __le32 length; + __le64 length; __u8rsvd[3]; __u8type; }; in what world changing a wire protocol for this make sense? please get rid of this hunk, NVMe will never cross the 32 bit sg element size.
Re: [PATCH] scatterlist: Update size type to support greater then 4GB size.
struct nvme_sgl_desc { __le64 addr; - __le32 length; + __le64 length; __u8rsvd[3]; __u8type; }; Isn't this a device or protocol defined datastructure? You can't just change it like this. You're correct, we can't... [Replied before seeing this issue was already highlighted] The positive side is that it can safely be removed without affecting the rest of the patch...
Re: [PATCH v4 2/2] trace nvme submit queue status
@@ -899,6 +900,10 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx) } req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id); + trace_nvme_sq(req->rq_disk, + nvmeq->qid, + le16_to_cpu(cqe->sq_head), + nvmeq->sq_tail); Why the newline escapes? why not escape at the 80 char border? Other than that, looks fine, Reviewed-by: Sagi Grimberg
Re: [PATCH v4 1/2] export trace.c helper functions to other modules
Reviewed-by: Sagi Grimberg
Re: [PATCH][next] nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme-rdma: complete requests from ->timeout
I cannot reproduce the bug with the patch; in my failure scenarios, it seems that completing the request on errors in nvme_rdma_send_done makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is safe from the double completions. However, it seems that nvme_rdma_timeout code is still not free from the double completion problem. So, it looks promising to me if you could separate out the nvme_rdma_wr_error handling code as a new patch. Guys, can you please send proper patches so we can review properly?
Re: [PATCH] nvme-core: don't initlialize ctrl->cntlid twice
Reviewed-by: Sagi Grimberg
Re: [PATCH 06/16] blk-mq: add 'type' attribute to the sysfs hctx directory
It can be useful for a user to verify what type a given hardware queue is, expose this information in sysfs. I assume the user is expected to understand the meaning of the enumeration it sees in accessing this field correct? Would be nice to output some meaningful string but I'm not yet sure how to do that, but would be nice... Other than that, Reviewed-by: Sagi Grimberg
Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
I also commented over there regarding this specific problem. I think my wording was inaccurate perhaps. It's not so much direct recursion, but a dependency chain connecting the handler_mutex class back to itself. This is how lockdep finds problems, it doesn't consider actual *instances* but only *classes*. As Sagi mentioned on the other thread, in this specific instance you happen to know out-of-band that this is safe, (somehow) the code ensures that you can't actually recursively connect back to yourself. Perhaps there's some kind of layering involved, I don't know. (*) As I said over there, this type of thing can be solved with mutex_lock_nested(), though it's possible that in this case it should really be flush_workqueue_nested() or flush_work_nested() or so. I guess that could make this specific complaint complaint go away... johannes (*) note that just checking "obj != self" might not be enough, depending on how you pick candidates for "obj" you might recurse down another level and end up back at yourself, just two levels down, and actually have a dependency chain leading back to yourself. Its not a simple check, any removal invocation is serialized under a dedicated workqueue and in order to be queued there, the connection need to first be connected which comes later (after the flush takes place).
Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"
I must also say that I'm disappointed you'd try to do things this way. I'd be (have been?) willing to actually help you understand the problem and add the annotations, but rather than answer my question ("where do I find the right git tree"!) you just send a revert patch. Sorry that I had not yet provided that information. You should have received this information through another e-mail thread. See also http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html. To do that, you have to understand what recursion is valid (I'm guessing there's some sort of layering involved), and I'm far from understanding anything about the code that triggered this report. I don't think there is any kind of recursion involved in the NVMe code that triggered the lockdep complaint. Sagi, please correct me if I got this wrong. I commented on the original thread. I'm not sure it qualifies as a recursion, but in that use-case, when priv->handler_mutex is taken it is possible that other priv->handler_mutex instances are taken but are guaranteed not to belong to that priv...
Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API
Regardless of the API discussion, The nvme-tcp bits look straight-forward: Acked-by: Sagi Grimberg
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right? Yes, which is why I don't think this check is very useful.. I actually view that as a valid protection against spoofed frames. Without it it's easy to crash the machine by injecting fake completions with random command ids. And this doesn't help because the command can have been easily reused and started... What is this protecting against? Note that none of the other transports checks that, why should tcp?
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right? 'during TCP LIF toggles and aggr relocates' testing the host crashes. TBH, I do not really know what is happening or what the test does. Still trying to figure out what's going on. Well, I think we should probably figure out why that is happening first. I was just very surprised how much the code trusts the other side to behave correctly./ What does pci/rdma/fc does differently? What does scsi do here? Yes, which is why I don't think this check is very useful.. I actually view that as a valid protection against spoofed frames. Without it it's easy to crash the machine by injecting fake completions with random command ids. In this test scenario it's not even a spoofed frame; maybe just a confused controller. Maybe... I am still not sure how this patch helps here though...
Re: [target:nvme_of 3/3] drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 'nvme_of_fabric_configfs' was not declared. Should it be static?
On 1/7/2015 12:44 AM, Nicholas A. Bellinger wrote: On Wed, 2015-01-07 at 03:35 +0800, kbuild test robot wrote: tree: git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git nvme_of head: 40d8c11927282d59855d645b35798edd97828da5 commit: 40d8c11927282d59855d645b35798edd97828da5 [3/3] nvme_of: Initial skeleton commit reproduce: # apt-get install sparse git checkout 40d8c11927282d59855d645b35798edd97828da5 make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 'nvme_of_fabric_configfs' was not declared. Should it be static? Please review and possibly fold the followup patch. Fixed. Thanks Fengguang! Hey Nic, Hope all is well. So this skeleton is interesting to me. As this is a fabric module I assume this would be the NVMEoFabrics target mode driver for upstream which would be able to talk to NVMEoFabrics initiator (which by the way makes LIO something more than a SCSI target - will this be accepted?). I'm currently participating in nvmexpress working group defining the standard of protocol, specifically in the context of RDMA (naturally). I'm interested in taking an active role in this project, It is important to build this layered from day one - separating the fabric logic (RDMA or FCoE) from the core layer. Moreover, I know that Mellanox has some plans on accelerating this area in future devices and we are currently looking into ways to do that. I want to see that the SW will be able to support that. So, Thoughts? Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [target:nvme_of 3/3] drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 'nvme_of_fabric_configfs' was not declared. Should it be static?
>> On Wed, 2015-01-07 at 10:57 +0200, Sagi Grimberg wrote: >>> On 1/7/2015 12:44 AM, Nicholas A. Bellinger wrote: >>>> On Wed, 2015-01-07 at 03:35 +0800, kbuild test robot wrote: >>>> tree: >>>> git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git >>>> nvme_of >>>> head: 40d8c11927282d59855d645b35798edd97828da5 >>>> commit: 40d8c11927282d59855d645b35798edd97828da5 [3/3] nvme_of: Initial >>>> skeleton commit >>>> reproduce: >>>> # apt-get install sparse >>>> git checkout 40d8c11927282d59855d645b35798edd97828da5 >>>> make ARCH=x86_64 allmodconfig >>>> make C=1 CF=-D__CHECK_ENDIAN__ >>>> >>>> >>>> sparse warnings: (new ones prefixed by >>) >>>> >>>>>> drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol >>>>>> 'nvme_of_fabric_configfs' was not declared. Should it be static? >>>> >>>> Please review and possibly fold the followup patch. >>> >>> Fixed. Thanks Fengguang! >> >> Hey Nic, >> >> Hope all is well. >> >> So this skeleton is interesting to me. As this is a fabric module I >> assume this would be the NVMEoFabrics target mode driver for upstream >> which would be able to talk to NVMEoFabrics initiator (which by >> the way makes LIO something more than a SCSI target - will this be >> accepted?). > > That is currently the plan. > > It's still unclear how this will work wrt to existing target code, but > given the amount of logic that NVMe is borrowing from SCSI on the > command set side (PR, ALUA, VAAI, DIF), I'm sure there is an opportunity > for code sharing between NVMe and SCSI based fabrics. Right, but I assume there will be a different method to provision this stuff given that the target exposes NVME SQs. > >> I'm currently participating in nvmexpress working group defining the >> standard of protocol, specifically in the context of RDMA (naturally). > > Glad to hear that your involved. ;) Should say passively involved, at the moment I'm just following this stuff. > >> I'm interested in taking an active role in this project, It is >> important to build this layered from day one - separating the fabric >> logic (RDMA or FCoE) from the core layer. >> >> Moreover, I know that Mellanox has some plans on accelerating this >> area in future devices and we are currently looking into ways to >> do that. I want to see that the SW will be able to support that. >> >> So, Thoughts? > > FYI, Kiran and Dave (CC'ed) from Intel will be driving upstream NVME-OF > target development. > > I'll be primarily focused on figuring out how NVMe and SCSI can play > nicely in target-core, and helping out on the NVMe-OF side where > necessary. Is there something open that I can start playing around with? What is the status here? I imagine that the initiator is developed as well... Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass
On 6/4/2015 10:06 AM, Nicholas A. Bellinger wrote: On Wed, 2015-06-03 at 14:57 +0200, Christoph Hellwig wrote: This makes lockdep very unhappy, rightly so. If you execute one end_io function inside another you basіcally nest every possible lock taken in the I/O completion path. Also adding more work to the hardirq path generally isn't a smart idea. Can you explain what issues you were seeing and how much this helps? Note that the workqueue usage in the target core so far is fairly basic, so there should some low hanging fruit. So I've been using tcm_loop + RAMDISK backends for prototyping, but this patch is intended for vhost-scsi so it can avoid the unnecessary queue_work() context switch within target_complete_cmd() for all backend driver types. This is because vhost_work_queue() is just updating vhost_dev->work_list and immediately wake_up_process() into a different vhost_worker() process context. For heavy small block workloads into fast IBLOCK backends, avoiding this extra context switch should be a nice efficiency win. I can see that, did you get a chance to measure the expected latency improvement? Also, AFAIK RDMA fabrics are allowed to do ib_post_send() response callbacks directly from IRQ context as well. This is correct in general, ib_post_send is not allowed to schedule. isert/srpt might benefit here in latency, but it would require the the drivers to pre-allocate the sgls (ib_sge's) and use a worst-case approach (or use GFP_ATOMIC allocations - I'm not sure which is better...) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
Note that we originally allocates irqs this way, and Keith changed it a while ago for good reasons. So I'd really like to see good reasons for moving away from this, and some heuristics to figure out which way to use. E.g. if the device supports more irqs than I/O queues your scheme might always be fine. I still don't understand what this buys us in practice. Seems redundant to allocate another vector without any (even marginal) difference.
Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
On 03/01/2018 01:40 AM, Logan Gunthorpe wrote: In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be called to map the correct DMA address. To do this, we add a flags variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is specified use the appropriate map function. Signed-off-by: Logan Gunthorpe --- drivers/infiniband/core/rw.c| 21 + drivers/infiniband/ulp/isert/ib_isert.c | 5 +++-- drivers/infiniband/ulp/srpt/ib_srpt.c | 7 --- drivers/nvme/target/rdma.c | 6 +++--- include/rdma/rw.h | 7 +-- net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +++--- 6 files changed, 35 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index c8963e91f92a..775a9f8b15a6 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -12,6 +12,7 @@ */ #include #include +#include #include #include @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp, * @remote_addr:remote address to read/write (relative to @rkey) * @rkey: remote key to operate on * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ + * @flags: any of the RDMA_RW_CTX_FLAG_* flags * * Returns the number of WQEs that will be needed on the workqueue if * successful, or a negative error code. */ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, struct scatterlist *sg, u32 sg_cnt, u32 sg_offset, - u64 remote_addr, u32 rkey, enum dma_data_direction dir) + u64 remote_addr, u32 rkey, enum dma_data_direction dir, + unsigned int flags) { struct ib_device *dev = qp->pd->device; int ret; - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); + if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA) + ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir); + else + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); + Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so much cleaner...
Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory
Hi Everyone, Hi Logan, Here's v2 of our series to introduce P2P based copy offload to NVMe fabrics. This version has been rebased onto v4.16-rc3 which already includes Christoph's devpagemap work the previous version was based off as well as a couple of the cleanup patches that were in v1. Additionally, we've made the following changes based on feedback: * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well as a bunch of cleanup and spelling fixes he pointed out in the last series. * To address Alex's ACS concerns, we change to a simpler method of just disabling ACS behind switches for any kernel that has CONFIG_PCI_P2PDMA. * We also reject using devices that employ 'dma_virt_ops' which should fairly simply handle Jason's concerns that this work might break with the HFI, QIB and rxe drivers that use the virtual ops to implement their own special DMA operations. That's good, but what would happen for these devices? simply fail the mapping causing the ulp to fail its rdma operation? I would think that we need a capability flag for devices that support it.
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
We create a configfs attribute in each nvme-fabrics target port to enable p2p memory use. When enabled, the port will only then use the p2p memory if a p2p memory device can be found which is behind the same switch as the RDMA port and all the block devices in use. If the user enabled it an no devices are found, then the system will silently fall back on using regular memory. If appropriate, that port will allocate memory for the RDMA buffers for queues from the p2pmem device falling back to system memory should anything fail. Nice. Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available. Can you describe what would be the plan to have it when these devices do come along? I'd say that p2p_dev needs to become a nvmet_ns reference and not from nvmet_ctrl. Then, when cmb capable devices come along, the ns can prefer to use its own cmb instead of locating a p2p_dev device? +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl, + struct nvmet_ns *ns) +{ + int ret; + + if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) { + pr_err("peer-to-peer DMA is not supported by %s\n", + ns->device_path); + return -EINVAL; I'd say that just skip it instead of failing it, in theory one can connect nvme devices via p2p mem and expose other devices in the same subsystem. The message would be pr_debug to reduce chattiness. + } + + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns)); + if (ret) + pr_err("failed to add peer-to-peer DMA client %s: %d\n", + ns->device_path, ret); + + return ret; +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns) if (ret) goto out_blkdev_put; + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (ctrl->p2p_dev) { + ret = nvmet_p2pdma_add_client(ctrl, ns); + if (ret) + goto out_remove_clients; Is this really a fatal failure given that we fall-back to main memory? Why not continue with main memory (and warn at best)? +/* + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for + * Ι/O commands. This requires the PCI p2p device to be compatible with the + * backing device for every namespace on this controller. + */ +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req) +{ + struct nvmet_ns *ns; + int ret; + + if (!req->port->allow_p2pmem || !req->p2p_client) + return; + + mutex_lock(&ctrl->subsys->lock); + + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client); + if (ret) { + pr_err("failed adding peer-to-peer DMA client %s: %d\n", + dev_name(req->p2p_client), ret); + goto free_devices; + } + + list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) { + ret = nvmet_p2pdma_add_client(ctrl, ns); + if (ret) + goto free_devices; + } + + ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients); This is the first p2p_dev found right? What happens if I have more than a single p2p device? In theory I'd have more p2p memory I can use. Have you considered making pci_p2pmem_find return the least used suitable device? + if (!ctrl->p2p_dev) { + pr_info("no supported peer-to-peer memory devices found\n"); + goto free_devices; + } + mutex_unlock(&ctrl->subsys->lock); + + pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev)); + return; + +free_devices: + pci_p2pdma_client_list_free(&ctrl->p2p_clients); + mutex_unlock(&ctrl->subsys->lock); +} + +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl) +{ + if (!ctrl->p2p_dev) + return; + + mutex_lock(&ctrl->subsys->lock); + + pci_p2pdma_client_list_free(&ctrl->p2p_clients); + pci_dev_put(ctrl->p2p_dev); + ctrl->p2p_dev = NULL; + + mutex_unlock(&ctrl->subsys->lock); +} + u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp) { @@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work); INIT_LIST_HEAD(&ctrl->async_events); + INIT_LIST_HEAD(&ctrl->p2p_clients); memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE); memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE); @@ -855,6 +946,7 @@ u16 nvmet_alloc_c
Re: [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB
Looks fine, Reviewed-by: Sagi Grimberg
Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests
For P2P requests we must use the pci_p2pmem_[un]map_sg() functions instead of the dma_map_sg functions. With that, we can then indicate PCI_P2P support in the request queue. For this, we create an NVME_F_PCI_P2P flag which tells the core to set QUEUE_FLAG_PCI_P2P in the request queue. This looks fine to me, Reviewed-by: Sagi Grimberg Any plans adding the capability to nvme-rdma? Should be straight-forward... In theory, the use-case would be rdma backend fabric behind. Shouldn't be hard to test either...
Re: [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue
Looks fine, Reviewed-by: Sagi Grimberg
Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory
On 01/03/18 04:03 AM, Sagi Grimberg wrote: Can you describe what would be the plan to have it when these devices do come along? I'd say that p2p_dev needs to become a nvmet_ns reference and not from nvmet_ctrl. Then, when cmb capable devices come along, the ns can prefer to use its own cmb instead of locating a p2p_dev device? The patchset already supports CMB drives. That's essentially what patch 7 is for. We change the nvme-pci driver to use the p2pmem code to register and manage the CMB memory. After that, it is simply available to the nvmet code. We have already been using this with a couple prototype CMB drives. The comment was to your statement: "Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available." Maybe its a left-over which confused me... Anyways, my question still holds. If I rack several of these nvme drives, ideally we would use _their_ cmbs for I/O that is directed to these namespaces. This is why I was suggesting that p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single p2p_dev used by all namespaces. nvmet_ns seems like a more natural place to host p2p_dev with cmb in mind. +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl, + struct nvmet_ns *ns) +{ + int ret; + + if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) { + pr_err("peer-to-peer DMA is not supported by %s\n", + ns->device_path); + return -EINVAL; I'd say that just skip it instead of failing it, in theory one can connect nvme devices via p2p mem and expose other devices in the same subsystem. The message would be pr_debug to reduce chattiness. No, it's actually a bit more complicated than you make it. There's a couple cases: 1) The user adds a namespace but there hasn't been a connection and no p2pmem has been selected. In this case the add_client function is never called and the user can add whatever namespace they like. 2) When the first connect happens, nvmet_setup_p2pmem() will call add_client() for each namespace and rdma device. If any of them fail then it does not use a P2P device and falls back, as you'd like, to regular memory. 3) When the user adds a namespace after a port is in use and a compatible P2P device has been found. In this case, if the user tries to add a namespace that is not compatible with the P2P device in use then it fails adding the new namespace. The only alternative here is to tear everything down, ensure there are no P2P enabled buffers open and start using regular memory again... That is very difficult. Wouldn't it all be simpler if the p2p_dev resolution would be private to the namespace? So is adding some all the namespaces in a subsystem must comply to using p2p? Seems a little bit harsh if its not absolutely needed. Would be nice to export a subsystems between two ports (on two HCAs, across NUMA nodes) where the home node (primary path) would use p2p and failover would use host memory... Can you help me understand why this is absolutely not feasible? I also disagree that these messages should be pr_debug. If a user is trying to use P2P memory and they enable it but the system doesn't choose to use their memory they must know why that is so they can make the necessary adjustments. If the system doesn't at least print a dmesg they are in the dark. I was arguing that the user might have intentionally wanted to use p2p where possible but still expose other namespaces over host memory. For this user, the messages are spam. I guess info/warn could also suffice (assuming we allow it, if we fail it then no point of the debug level discussion). + } + + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns)); + if (ret) + pr_err("failed to add peer-to-peer DMA client %s: %d\n", + ns->device_path, ret); + + return ret; +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns) if (ret) goto out_blkdev_put; + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (ctrl->p2p_dev) { + ret = nvmet_p2pdma_add_client(ctrl, ns); + if (ret) + goto out_remove_clients; Is this really a fatal failure given that we fall-back to main memory? Why not continue with main memory (and warn at best)? See above. It's fatal because we are already using p2p memory and we can't easily tear that all down when a user adds a new namespace. In my mind, I/O to this namespace would use host memory and be done with it. I guess I still don't understand why this is not pos
Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function
Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme-multipath: implement active-active round-robin path selector
For PCIe devices the right policy is not a round robin but to use the pcie device closer to the node. I did a prototype for that long ago and the concept can work. Can you look into that and also make that policy used automatically for PCIe devices? I think that active/active makes sense for fabrics (link throughput aggregation) but also for dual-ported pci devices (given that this is a real use-case). I agree that the default can be a home-node path selection.
Re: [PATCH] nvme-multipath: implement active-active round-robin path selector
@@ -85,7 +121,14 @@ static blk_qc_t nvme_ns_head_make_request(struct request_queue *q, int srcu_idx; srcu_idx = srcu_read_lock(&head->srcu); - ns = nvme_find_path(head); + switch (head->mpath_policy) { + case NVME_MPATH_ROUND_ROBIN: + ns = nvme_find_path_rr(head); + break; + case NVME_MPATH_ACTIVE_STANDBY: + default: + ns = nvme_find_path(head); + } If we grow multiple path selectors, would be more elegant to use a callout mechanism.
Re: [PATCH 2/4] nvmet-rdma: Use new SGL alloc/free helper for requests
@@ -565,24 +565,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp, { struct rdma_cm_id *cm_id = rsp->queue->cm_id; u64 addr = le64_to_cpu(sgl->addr); - u32 len = get_unaligned_le24(sgl->length); u32 key = get_unaligned_le32(sgl->key); int ret; + rsp->req.transfer_len = get_unaligned_le24(sgl->length); + IIRC, this might result in nvmet-rdma executing data-transfer even for failed requests in some error cases. I'm not sure this is the only case, but have you tested what happens in error cases? See nvmet_rdma_need_data_in()/nvmet_rdma_need_data_out() which look at transfer_len. /* no data command? */ - if (!len) + if (!rsp->req.transfer_len) return 0; - rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt); - if (!rsp->req.sg) - return NVME_SC_INTERNAL; + ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq); + if (ret < 0) + goto error_out; ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num, rsp->req.sg, rsp->req.sg_cnt, 0, addr, key, nvmet_data_dir(&rsp->req)); if (ret < 0) - return NVME_SC_INTERNAL; - rsp->req.transfer_len += len; + goto error_out; rsp->n_rdma += ret; if (invalidate) { @@ -591,6 +591,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp, } return 0; + +error_out: + rsp->req.transfer_len = 0; + return NVME_SC_INTERNAL; } static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
Re: [PATCH 3/4] nvmet-fc: Don't use the count returned by the dma_map_sg call
Reviewed-by: Sagi Grimberg
Re: [PATCH 0/3] Provide more fine grained control over multipathing
[so much for putting out flames... :/] This projecting onto me that I've not been keeping the conversation technical is in itself hostile. Sure I get frustrated and lash out (as I'm _sure_ you'll feel in this reply) You're right, I do feel this is lashing out. And I don't appreciate it. Please stop it. We're not going to make progress otherwise. Can you (or others) please try and articulate why a "fine grained" multipathing is an absolute must? At the moment, I just don't understand. Already made the point multiple times in this thread [3][4][5][1]. Hint: it is about the users who have long-standing expertise and automation built around dm-multipath and multipath-tools. BUT those same users may need/want to simultaneously use native NVMe multipath on the same host. Dismissing this point or acting like I haven't articulated it just illustrates to me continuing this conversation is not going to be fruitful. The vast majority of the points are about the fact that people still need to be able to use multipath-tools, which they still can today. Personally, I question the existence of this user base you are referring to which would want to maintain both dm-multipath and nvme personalities at the same time on the same host. But I do want us to make progress, so I will have take this need as a given. I agree with Christoph that changing personality on the fly is going to be painful. This opt-in will need to be one-host at connect time. For that, we will probably need to also expose an argument in nvme-cli too. Changing the mpath personality will need to involve disconnecting the controller and connecting again with the argument toggled. I think this is the only sane way to do this. Another path we can make progress in is user visibility. We have topology in place and you mentioned primary path (which we could probably add). What else do you need for multipath-tools to support nvme?
Re: [PATCH v2] nvme: trace: add disk name to tracepoints
Add disk name to tracepoints so we can better destinguish between individual disks in the trace output. Signed-off-by: Johannes Thumshirn Reviewed-by: Sagi Grimberg Nit: s/destinguish/distinguish/g Christoph, can you fix it up when applying or you want me to do it?
Re: [PATCH 0/3] Provide more fine grained control over multipathing
We plan to implement all the fancy NVMe standards like ANA, but it seems that there is still a requirement to let the host side choose policies about how to use paths (round-robin vs least queue depth for example). Even in the modern SCSI world with VPD pages and ALUA, there are still knobs that are needed. Maybe NVMe will be different and we can find defaults that work in all cases but I have to admit I'm skeptical... The sensible thing to do in nvme is to use different paths for different queues. Huh? different paths == different controllers so this sentence can't be right... you mean that a path selector will select a controller based on the home node of the local rdma device connecting to it and the running cpu right?
Re: [PATCH v2 0/2] Adding per-controller timeout support to nvme
As different nvme controllers are connect via different fabrics, some require different timeout settings than others. This series implements per-controller timeouts in the nvme subsystem which can be set via sysfs. How much of a real issue is this? block io_timeout defaults to 30 seconds which are considered a universal eternity for pretty much any nvme fabric. Moreover, io_timeout is mutable already on a per-namespace level. This leaves the admin_timeout which goes beyond this to 60 seconds... Can you describe what exactly are you trying to solve?
Re: [PATCH 1/3] drivers: nvme: target: core: fix build break
Reviewed-by: Sagi Grimberg
Re: [PATCH v2 0/2] Adding per-controller timeout support to nvme
As different nvme controllers are connect via different fabrics, some require different timeout settings than others. This series implements per-controller timeouts in the nvme subsystem which can be set via sysfs. How much of a real issue is this? block io_timeout defaults to 30 seconds which are considered a universal eternity for pretty much any nvme fabric. Moreover, io_timeout is mutable already on a per-namespace level. This leaves the admin_timeout which goes beyond this to 60 seconds... Can you describe what exactly are you trying to solve? I think they must have an nvme target that is backed by slow media (i.e. non-SSD). If that's the case, I think it may be a better option if the target advertises relatively shallow queue depths and/or lower MDTS that better aligns to the backing storage capabilies. It isn't that the media is slow; the max timeout is based on the SLA for certain classes of "fabric" outages. Linux copes *really* badly with I/O errors, and if we can make the timeout last long enough to cover the switch restart worst case, then users are a lot happier. Well, what is usually done to handle fabric outages is having multiple paths to the storage device, not sure if that is applicable for you or not... What do you mean by "Linux copes *really* badly with I/O errors"? What can be done better?
Re: [PATCH] nvmet: disable direct I/O when unavailable
Frankly, we have a ton of testing related special cases in the kernel. This one is a) simple and small, only 10 LoC, b) far away from the fast path or any other place where it could have any impact on legitimate users and c) it prints an informal message showing you what happened. Sorry but this is a https://xkcd.com/386/ moment. How about we just fail it with a proper error message and let the user set buffered_io instead of trying to get smart here...
Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs
I was thinking of the stuff in core/cq.c - but it also doesn't have automatic comp_vector balancing. It is the logical place to put something like that though.. An API to manage a bundle of CPU affine CQ's is probably what most ULPs really need.. (it makes little sense to create a unique CQ for every QP) ULPs behave way differently. E.g. RDS creates one tx and one rx CQ per QP. As I wrote earlier, we do not have any modify_cq() that changes the comp_vector (EQ association). We can balance #CQ associated with the EQs, but we do not know their behaviour. So, assume 2 completion EQs, and four CQs. CQa and CQb are associated with the first EQ, the two others with the second EQ. That's the "best" we can do. But, if CQa and CQb are the only ones generating events, we will have all interrupt processing on a single CPU. But if we now could modify CQa.comp_vector to be that of the second EQ, we could achieve balance. But not sure if the drivers are able to do this at all. alloc_bundle() You mean alloc a bunch of CQs? How do you know their #cqes and cq_context? Håkon get_cqn_for_flow(bundle) alloc_qp() destroy_qp() put_cqn_for_flow(bundle) destroy_bundle(); Let the core code balance the cqn's and allocate (shared) CQ resources. Jason I sent a simple patchset back in the day for it [1], IIRC there was some resistance of having multiple ULPs implicitly share the same completion queues: [1]: -- RDMA/core: Add implicit per-device completion queue pools Allow a ULP to ask the core to implicitly assign a completion queue to a queue-pair based on a least-used search on a per-device cq pools. The device CQ pools grow in a lazy fashion with every QP creation. In addition, expose an affinity hint for a queue pair creation. If passed, the core will attempt to attach a CQ with a completion vector that is directed to the cpu core as the affinity hint provided. Signed-off-by: Sagi Grimberg -- That one added implicit QP create flags: -- diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index bdb1279a415b..56d42e753eb4 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1098,11 +1098,22 @@ enum ib_qp_create_flags { IB_QP_CREATE_SCATTER_FCS= 1 << 8, IB_QP_CREATE_CVLAN_STRIPPING= 1 << 9, IB_QP_CREATE_SOURCE_QPN = 1 << 10, + + /* only used by the core, not passed to low-level drivers */ + IB_QP_CREATE_ASSIGN_CQS = 1 << 24, + IB_QP_CREATE_AFFINITY_HINT = 1 << 25, + -- Then I modified it to add a ib_cq_pool that a ULP can allocate privately and then get/put CQs from/to. [2]: -- IB/core: Add a simple CQ pool API Using CQ pools is useful especially for target/server modes. The server/target implementation will usually serve multiple clients and will usually have an array of completion queues allocated for that. In addition, usually the server/target implementation will use a least-used scheme to select a completion vector to each completion queue in order to acheive better parallelism. Having the server/target rdma queue-pairs share completion queues as much as possible is desirable as it allows for better completion aggragation. One downside of this approach is that some entries of the completion queues might never be used in case the queue-pairs sizes are not fixed. This simple CQ pool API allows for both optimizations and exposes a simple API to alloc/free a completion queue pool and get/put from the pool. The pool starts by allocating a caller-defined batch of CQs, and grows in batches in a lazy fashion. Signed-off-by: Sagi Grimberg -- That one had the CQ pool API: -- +struct ib_cq_pool *ib_alloc_cq_pool(struct ib_device *device, int nr_cqe, + int nr_cqs, enum ib_poll_context poll_ctx); +void ib_free_cq_pool(struct ib_cq_pool *pool); +void ib_cq_pool_put(struct ib_cq *cq, unsigned int nents); +struct ib_cq *ib_cq_pool_get(struct ib_cq_pool *pool, unsigned int nents); -- I can try to revive this if this becomes interesting again to anyone.. Thoughts?
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request?
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right? Yes, which is why I don't think this check is very useful..
Re: [PATCH] nvme: convert sysfs sprintf/snprintf family to sysfs_emit
Reviewed-by: Sagi Grimberg
Re: [PATCH v2] nvme-multipath: Early exit if no path is available
You can't see exactly where it dies but I followed the assembly to nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head, old) which returns NULL but nvme_next_ns() is returning NULL eventually (list_next_or_null_rcu()). So there is other bug cause nvme_next_ns abormal. I review the code about head->list and head->current_path, I find 2 bugs may cause the bug: First, I already send the patch. see: https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ Second, in nvme_ns_remove, list_del_rcu is before nvme_mpath_clear_current_path. This may cause "old" is deleted from the "head", but still use "old". I'm not sure there's any other consideration here, I will check it and try to fix it. The reason why we first remove from head->list and only then clear current_path is because the other way around there is no way to guarantee that that the ns won't be assigned as current_path again (because it is in head->list). nvme_ns_remove fences continue of deletion of the ns by synchronizing the srcu such that for sure the current_path clearance is visible.
Re: [PATCH v2] nvme-multipath: Early exit if no path is available
You can't see exactly where it dies but I followed the assembly to nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head, old) which returns NULL but nvme_next_ns() is returning NULL eventually (list_next_or_null_rcu()). So there is other bug cause nvme_next_ns abormal. I review the code about head->list and head->current_path, I find 2 bugs may cause the bug: First, I already send the patch. see: https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ Second, in nvme_ns_remove, list_del_rcu is before nvme_mpath_clear_current_path. This may cause "old" is deleted from the "head", but still use "old". I'm not sure there's any other consideration here, I will check it and try to fix it. The reason why we first remove from head->list and only then clear current_path is because the other way around there is no way to guarantee that that the ns won't be assigned as current_path again (because it is in head->list). ok, I see. nvme_ns_remove fences continue of deletion of the ns by synchronizing the srcu such that for sure the current_path clearance is visible. The list will be like this: head->next = ns1; ns1->next = head; old->next = ns1; This may cause infinite loop in nvme_round_robin_path. for (ns = nvme_next_ns(head, old); ns != old; ns = nvme_next_ns(head, ns)) The ns will always be ns1, and then infinite loop. Who is being removed? I'm not following
Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143
Greeting, FYI, we noticed the following commit (built with gcc-9): commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new shared CQ mechanism") https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next in testcase: blktests with following parameters: test: srp-group1 ucode: 0x21 on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 4G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot user :notice: [ 44.688140] 2020-08-01 16:10:22 ./check srp/001 srp/002 srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013 srp/015 user :notice: [ 44.706657] srp/001 (Create and remove LUNs) user :notice: [ 44.718405] srp/001 (Create and remove LUNs) [passed] user :notice: [ 44.729902] runtime ... 1.972s user :notice: [ 99.038748] IPMI BMC is not supported on this machine, skip bmc-watchdog setup! user :notice: [ 3699.039790] Sat Aug 1 17:11:22 UTC 2020 detected soft_timeout user :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o /tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests Yamin and Max, can you take a look at this? The SRP tests from the blktests repository pass reliably with kernel version v5.7 and before. With label next-20200731 from linux-next however that test triggers the following hang: I will look into it. FWIW, I ran into this as well with nvme-rdma, but it also reproduces when I revert the shared CQ patch from nvme-rdma. Another data point is that my tests passes with siw.
Re: [PATCH] nvmet-passthru: Reject commands with non-sgl flags set
Reviewed-by: Sagi Grimberg
Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()
This patch adds a check in nvmet_execute_passthru_cmd() to prevent the following oops :- Why LKML and not linux-nvme?
Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()
This patch adds a check in nvmet_execute_passthru_cmd() to prevent the following oops :- Why LKML and not linux-nvme? My bad (+linux-nvme). It doesn't have the patch. can you resend?