Re: [PATCH 0/3] nvme: Don't add namespaces for locked drives

2016-06-19 Thread Sagi Grimberg



Hi all,

If an NVMe drive is locked with ATA Security, most commands sent to the drive
will fail. This includes commands sent by the kernel upon discovery to probe
for partitions. The failing happens in such a way that trying to do anything
with the drive (e.g. sending an unlock command; unloading the nvme module) is
basically impossible with the high default command timeout.

This patch adds a check to see if the drive is locked, and if it is, its
namespaces are not initialized. It is expected that userspace will send the
proper "security send/unlock" command and then reset the controller. Userspace
tools are available at [1].

This is my first kernel patch so please let me know if you have any feedback.

I intend to also submit a future patch that tracks ATA Security commands sent
from userspace and remembers the password so it can be submitted to a locked
drive upon pm_resume. (still WIP)

Jethro Beekman

[1] https://github.com/jethrogb/nvme-ata-security

Jethro Beekman (3):
   nvme: When scanning namespaces, make sure the drive is not locked
   nvme: Add function for NVMe security receive command
   nvme: Check if drive is locked using ATA Security


Hey Jethro,

I think it would make better sense to squash patches 1,3 together and
have patch 2 come before them:

patch 1: nvme: Add function for NVMe security receive command
patch 2: nvme: Check if drive is locked using ATA Security when scanning 
namespaces


Re: [rfc] weirdness in bio_map_user_iov()

2016-09-23 Thread Sagi Grimberg

Hey Al,


What happens if we feed it a 3-element iovec array, one page in each?
AFAICS, bio_add_pc_page() is called for each of those pages, even if
the previous calls have failed - break is only out of the inner loop.

Sure, failure due to exceeded request size means that everything after
that one will fail, but what of e.g.
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
 */
if (bvec_gap_to_prev(q, prev, offset))
return 0;
in there?  Won't we risk having the first and the third pages added, with
the second one quietly skipped?  Jens, looks like it had come from you
(by way of jejb).  Am I missing something subtle here?


So AFAICT, 'gappy' iovecs will never reach bio_map_user_iov() because it
is checked before in blk_rq_map_user_iov():

if (map_data)
copy = true;
else if (iov_iter_alignment(iter) & align)
copy = true;
else if (queue_virt_boundary(q))
copy = queue_virt_boundary(q) & 
iov_iter_gap_alignment(iter);


if iov_iter_gap_alignment(iter) detects gaps it will use
a nice aligned bounce via bio_copy_user_iov().


Re: [PATCH 11/13] nvme: switch to use pci_alloc_irq_vectors

2016-09-23 Thread Sagi Grimberg



On 14/09/16 07:18, Christoph Hellwig wrote:

Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.


Patch title and the change description are a little short IMO to
describe what is going on here (need the blk-mq side too).

I'd also think it would be better to split this to 2 patches but
really not a must...


+static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
+{
+   struct nvme_dev *dev = set->driver_data;
+
+   return blk_mq_pci_map_queues(set, to_pci_dev(dev->dev));
+}
+


Question: is using pci_alloc_irq_vectors() obligated for
supplying blk-mq with the device affinity mask?

If I do this completely-untested [1] what will happen?

[1]:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 8d2875b4c56d..76693d406efe 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1518,6 +1518,14 @@ static void nvme_rdma_complete_rq(struct request *rq)
blk_mq_end_request(rq, error);
 }

+static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
+{
+   struct nvme_rdma_ctrl *ctrl = set->driver_data;
+   struct device *dev = ctrl->device->dev.dma_device;
+
+   return blk_mq_pci_map_queues(set, to_pci_dev(dev));
+}
+
 static struct blk_mq_ops nvme_rdma_mq_ops = {
.queue_rq   = nvme_rdma_queue_rq,
.complete   = nvme_rdma_complete_rq,
@@ -1528,6 +1536,7 @@ static struct blk_mq_ops nvme_rdma_mq_ops = {
.init_hctx  = nvme_rdma_init_hctx,
.poll   = nvme_rdma_poll,
.timeout= nvme_rdma_timeout,
+   .map_queues = nvme_rdma_map_queues,
 };

 static struct blk_mq_ops nvme_rdma_admin_mq_ops = {
--


Re: [PATCH] nvme: Fix nvme_get/set_features() with a NULL result pointer

2016-08-24 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg 


Re: [PATCH] nvme: add hostid token to fabric options

2017-06-20 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2 08/15] nvmet: make config_item_type const

2017-10-17 Thread Sagi Grimberg

Acked-by: Sagi Grimberg 


Re: [PATCH V3] nvme: fix multiple ctrl removal scheduling

2017-05-30 Thread Sagi Grimberg

Hi Rkesh,

this looks reasonable, but we'll need to also adopt the non-PCI
driver to the new state machine.  I can give this a spin.

At that point we probably want to move nvme_reset into common
code somehow.


Hi Guys, sorry for barging in late, I've been way too busy with
internal stuff lately...

I think that adding a new state should (a) be added with careful
understanding that its absolutely needed and (b) does not complicate
the state machine.

I honestly think that adding a new state that says "we scheduled a
reset" to address a synchronization issue is not what we should do.

1. I think that state NVME_CTRL_RESETTING semantically means that
the reset flow has been scheduled and the state transition atomicity
suffices for synchronization. So nvme_reset should change the state
and if it succeeded, schedule the reset_work instead of changing the
state inside reset_work (like we do in fabrics). At this point we should
lose the WARN_ON.

2. I personally think that nvme_probe shouldn't necessarily trigger
controller reset, if we can split reset to a couple of useful routines
we can reuse them in nvme_probe. The reason is that for reset we need
to address various conditions (errors, ongoing traffic etc...) that
are not relevant at all for probe. Not sure if anyone agrees with me
on this one.



I started to experiment with trying to move some of this to nvme core[1]
(rdma and loop) but has yet to fully convert pci which is a bit more
complicated.

[1] git://git.infradead.org/users/sagi/linux.git 
nvme-central-reset-delete-err


Re: [PATCH V2] nvme: fix nvme_remove going to uninterruptible sleep for ever

2017-05-30 Thread Sagi Grimberg



/*
+* Avoid configuration and syncing commands if controller is already
+* being removed and queues have been killed.
+*/
+   if (ctrl->state == NVME_CTRL_DELETING || ctrl->state == NVME_CTRL_DEAD)
+   return;
+


Hey Rakesh, Christoph,

Given that the issue is for sync command submission during controller
removal, I'm wandering if we should perhaps move this check to
__nvme_submit_sync_cmd?

AFAICT user-space can just as easily trigger set_features in the same
condition which will trigger the hang couldn't it?


Re: [PATCH V2 1/1] nvme: fix multiple ctrl removal scheduling

2017-05-30 Thread Sagi Grimberg



Allowing multiple resets can result in multiple controller removal as
well if different conditions inside nvme_reset_work fail and which
might deadlock on device_release_driver.

This patch makes sure that work queue item (reset_work) is added only
if controller state != NVME_CTRL_RESETTING and that is achieved by
moving state change outside nvme_reset_work into nvme_reset and
removing old work_busy call.  State change is always synchronizated
using controller spinlock.


So, the reason the state is changed when the work is running rather than
queueing is for the window when the state may be set to NVME_CTRL_DELETING,
and we don't want the reset work to proceed in that case.

What do you think about adding a new state, like NVME_CTRL_SCHED_RESET,
then leaving the NVME_CTRL_RESETTING state change as-is?


OK, just got to this one.

Instead of adding yet another state, how about making controller delete
cancel the reset_work (cancel_work_sync)?


Re: [PATCH v3] nvme/pci: remap BAR0 to cover admin CQ doorbell for large stride

2017-05-30 Thread Sagi Grimberg

Looks good to me,

Reviewed-by: Sagi Grimberg 


Re: [PATCH] nvme: host: core: fix NULL pointer dereference in nvme_pr_command

2017-11-13 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH] IB/iser: fix a comment typo

2015-10-05 Thread Sagi Grimberg

On 10/4/2015 12:00 PM, Geliang Tang wrote:

Just fix a typo in the code comment.

Signed-off-by: Geliang Tang 


Acked-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] IB/core: avoid 32-bit warning

2015-10-07 Thread Sagi Grimberg

On 10/7/2015 3:29 PM, Arnd Bergmann wrote:

The INIT_UDATA() macro requires a pointer or unsigned long argument for
both input and output buffer, and all callers had a cast from when
the code was merged until a recent restructuring, so now we get

core/uverbs_cmd.c: In function 'ib_uverbs_create_cq':
core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of different 
size [-Wint-to-pointer-cast]

This makes the code behave as before by adding back the cast to
unsigned long.

Signed-off-by: Arnd Bergmann 
Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq")

diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index be4cb9f04be3..88b3b78340f2 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
if (copy_from_user(&cmd, buf, sizeof(cmd)))
return -EFAULT;

-   INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd), sizeof(resp));
+   INIT_UDATA(&ucore, buf, (unsigned long)cmd.response, sizeof(cmd), 
sizeof(resp));


Would it make sense to cast inside INIT_UDATA() and not have callers
worry about it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] iscsi-target: return -ENOMEM instead of -1 in case of failed kmalloc()

2015-10-20 Thread Sagi Grimberg

On 10/19/2015 11:18 PM, Luis de Bethencourt wrote:

Smatch complains about returning hard coded error codes, silence this
warning.

drivers/target/iscsi/iscsi_target_parameters.c:211 
iscsi_create_default_params() warn: returning -1 instead of -ENOMEM is sloppy

Signed-off-by: Luis de Bethencourt 


Looks good,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints

2018-06-25 Thread Sagi Grimberg




On 06/25/2018 10:08 AM, Johannes Thumshirn wrote:

On Tue, Jun 19, 2018 at 05:09:27PM +0300, Sagi Grimberg wrote:

We are going to need it for traffic based keep alive to update
that we saw a completion to extend the kato.

But I suggest you simply keep a ctrl reference in struct nvme_request
instead so you don't need to pass it to nvme_complete_req (that's what
I did for traffic based keep alive).


Do you have a patch for this around? IIRC I started this (as Christoph
also suggested it) but it turned out to be quite a big refactoring
work.


How about the below? patch #1 is what you are looking for, patch #2 is
a slightly modified version that applies on #1. Let me know what you
think...

[1]:
--
nvme: cache struct nvme_ctrl reference to struct nvme_request

We will need to reference the controller in the setup and
completion time for tracing and future traffic based keep alive
support.

Signed-off-by: Sagi Grimberg 

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e8cdb5409725..f53416619905 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -390,6 +390,7 @@ static inline void nvme_clear_nvme_request(struct 
request *req)

if (!(req->rq_flags & RQF_DONTPREP)) {
nvme_req(req)->retries = 0;
nvme_req(req)->flags = 0;
+   nvme_req(req)->ctrl = NULL;
req->rq_flags |= RQF_DONTPREP;
}
 }
@@ -622,8 +623,8 @@ static inline blk_status_t nvme_setup_rw(struct 
nvme_ns *ns,

return 0;
 }

-blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
-   struct nvme_command *cmd)
+blk_status_t nvme_setup_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+   struct request *req, struct nvme_command *cmd)
 {
blk_status_t ret = BLK_STS_OK;

@@ -652,6 +653,7 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, 
struct request *req,

}

cmd->common.command_id = req->tag;
+   nvme_req(req)->ctrl = ctrl;
if (ns)
trace_nvme_setup_nvm_cmd(req->q->id, cmd);
else
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b528a2f5826c..99f683ed079e 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2274,7 +2274,7 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
return nvmf_fail_nonready_command(rq);

-   ret = nvme_setup_cmd(ns, rq, sqe);
+   ret = nvme_setup_cmd(&ctrl->ctrl, ns, rq, sqe);
if (ret)
return ret;

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 231807cbc849..e4a2145f3c9a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -102,6 +102,7 @@ struct nvme_request {
u8  retries;
u8  flags;
u16 status;
+   struct nvme_ctrl*ctrl;
 };

 /*
@@ -419,8 +420,8 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl);
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
struct nvme_command *cmd, blk_mq_req_flags_t flags, int 
qid);

-blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
-   struct nvme_command *cmd);
+blk_status_t nvme_setup_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+   struct request *req, struct nvme_command *cmd);
 int nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command 
*cmd,

void *buf, unsigned bufflen);
 int __nvme_submit_sync_cmd(struct request_queue *q, struct 
nvme_command *cmd,

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fc33804662e7..377e08c70666 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -877,7 +877,7 @@ static blk_status_t nvme_queue_rq(struct 
blk_mq_hw_ctx *hctx,

if (unlikely(nvmeq->cq_vector < 0))
return BLK_STS_IOERR;

-   ret = nvme_setup_cmd(ns, req, &cmnd);
+   ret = nvme_setup_cmd(&dev->ctrl, ns, req, &cmnd);
if (ret)
return ret;

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index f621920af823..4de8017da484 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1691,7 +1691,7 @@ static blk_status_t nvme_rdma_queue_rq(struct 
blk_mq_hw_ctx *hctx,

ib_dma_sync_single_for_cpu(dev, sqe->dma,
sizeof(struct nvme_command), DMA_TO_DEVICE);

-   ret = nvme_setup_cmd(ns, rq, c);
+   ret = nvme_setup_cmd(&queue->ctrl->ctrl, ns, rq, c);
if (ret)
return ret;

diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index d8d91f04bd7e..888bd3fefc4d 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -164,7 +164,7 @@ static blk_status_t nvme_lo

Re: [PATCH v5 1/2] nvme: add tracepoint for nvme_setup_cmd

2018-01-23 Thread Sagi Grimberg

Looks good Johannes,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v5 2/2] nvme: add tracepoint for nvme_complete_rq

2018-01-23 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH RESENT] nvme-pci: introduce RECONNECTING state to mark initializing procedure

2018-01-23 Thread Sagi Grimberg

This looks fine to me, but we need Keith to ack on this.


Re: [RFC] NVMe Configuraiton using sysctl

2017-05-15 Thread Sagi Grimberg



Hi,


Hi Oza,


we are configuring interrupt coalesce for NVMe, but right now, it uses
module param.
so the same interrupt coalesce settings get applied for all the NVMEs
connected to different RCs.

ideally it should be with sysctl.


If at all, I would place this in nvme-cli (via ioctl) instead of
sysctl.


for e.g.
sysctl should provide interface to change
Per-CPU IO queue pairs, interrupt coalesce settings etc..


My personal feeling is that percpu granularity is a lot to take in for
the user, and also can yield some unexpected performance
characteristics. But I might be wrong here..


please suggest if we could have/implement sysctl module for NVMe ?


I have asked this before, but interrupt coalescing has very little
merit without being able to be adaptive. net drivers maintain online
stats and schedule interrupt coalescing modifications.

Should work in theory, but having said that, interrupt coalescing as a
whole is essentially unusable in nvme since the coalescing time limit
is in units of 100us increments...


Re: [PATCH] nvme-pci: use NOWAIT flag for nvme_set_host_mem

2018-01-29 Thread Sagi Grimberg

Hey Keith,


Thanks for the fix. It looks like we still have a problem, though.
Commands submitted with the "shutdown_lock" held need to be able to make
forward progress without relying on a completion, but this one could
block indefinitely.


Can you explain to me why is the shutdown_lock needed to synchronize
nvme_dev_disable? More concretely, how is nvme_dev_disable different
from other places where we rely on the ctrl state to serialize stuff?

The only reason I see would be to protect against completion-after-abort
scenario but I think the block layer should protect against it (checks
if the request timeout timer fired).


Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-17 Thread Sagi Grimberg



After Sagi's nvme-rdma: fix concurrent reset and reconnect, the rdma ctrl state 
is changed to RECONNECTING state
after some clearing and shutdown work, then some initializing procedure,  no 
matter reset work path or error recovery path.
The fc reset work also does the same thing.
So if we define the range that RESET_PREPARE includes scheduling gap and 
disable and clear work, RESETTING includes initializing
procedure,  RECONNECTING is very similar with RESETTING.

Maybe we could do like this;
In nvme fc/rdma
- set state to RESET_PREPARE, queue reset_work/err_work
- clear/shutdown works, set state to RECONNECTING


Should be fine.


In nvme pci
- set state to RESET_PREPARE, queue reset_work
- clear/shutdown works, set state to RESETTING
- initialization, set state to LIVE


Given that we split reset state and we have a clear symmetry between
the transports, do we want to maybe come up with a unique state that is
coherent across all transports?

Maybe we rename them to NVME_CTRL_SHUTTING_DOWN and
NVME_CTRL_ESTABLISHING? I'm open for better names..


Re: [Suspected-Phishing]Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-17 Thread Sagi Grimberg

see V4 discussion.. :)


I didn't see any v4


Re: [PATCH] nvme-pci: calculate iod and avg_seg_size just before use them

2018-01-14 Thread Sagi Grimberg



The calculation of iod and avg_seg_size maybe meaningless if
nvme_pci_use_sgls returns before uses them. So calculate
just before use them.


The compiler will do the right thing here, but I see what you mean. I
think Christoph has some SGL improvements coming, though, that will
obviate this.


I think that if its not coming for 4.16, it should be easy enough to
take it in.

FWIW,

Reviewed-by: Sagi Grimberg 


Re: [PATCH V3 1/2] nvme: split resetting state into reset_prepate and resetting

2018-01-14 Thread Sagi Grimberg



Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
before queue the reset work. This is not so strict. There could be
a big gap before the reset_work callback is invoked. In addition,
there is some disable work in the reset_work callback, strictly
speaking, not part of reset work, and could lead to some confusion.

This patch splits the NVME_CTRL_RESETTING into NVME_CTRL_RESET_PREPARE
and NVME_CTRL_RESETTING. Before queue the reset work, changes state
to NVME_CTRL_RESET_PREPARE, after disable work completes, changes
state to NVME_CTRL_RESETTING.


What guarantees that nvme_timeout will do the right thing? If it
is relying on the fact that nvme-pci sets the state to RESETTING
only _after_ quiescing the requests queues it needs to be documented
as its a real delicate dependency.





Suggested-by: Christoph Hellwig 
Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c   | 17 +++--
  drivers/nvme/host/fc.c |  2 ++
  drivers/nvme/host/nvme.h   |  1 +
  drivers/nvme/host/pci.c| 28 ++--
  drivers/nvme/host/rdma.c   |  8 
  drivers/nvme/target/loop.c |  5 +
  6 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..106a437 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -87,7 +87,7 @@ static __le32 nvme_get_log_dw10(u8 lid, size_t size)
  
  int nvme_reset_ctrl(struct nvme_ctrl *ctrl)

  {
-   if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+   if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESET_PREPARE))
return -EBUSY;
if (!queue_work(nvme_wq, &ctrl->reset_work))
return -EBUSY;
@@ -243,7 +243,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
-   case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
@@ -253,10 +253,21 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
+
+   case NVME_CTRL_RESETTING:
+   switch (old_state) {
+   case NVME_CTRL_RESET_PREPARE:
+   changed = true;
+   /* FALLTHRU */
+   default:
+   break;
+   }
+   break;
case NVME_CTRL_RECONNECTING:
switch (old_state) {
case NVME_CTRL_LIVE:
case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
changed = true;
/* FALLTHRU */
default:
@@ -267,6 +278,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (old_state) {
case NVME_CTRL_LIVE:
case NVME_CTRL_RESETTING:
+   case NVME_CTRL_RESET_PREPARE:
case NVME_CTRL_RECONNECTING:
changed = true;
/* FALLTHRU */
@@ -2603,6 +2615,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
[NVME_CTRL_NEW] = "new",
[NVME_CTRL_LIVE]= "live",
[NVME_CTRL_RESETTING]   = "resetting",
+   [NVME_CTRL_RESET_PREPARE]   = "reset-prepare",
[NVME_CTRL_RECONNECTING]= "reconnecting",
[NVME_CTRL_DELETING]= "deleting",
[NVME_CTRL_DEAD]= "dead",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 794e66e..516c1ea 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -547,6 +547,7 @@ nvme_fc_resume_controller(struct nvme_fc_ctrl *ctrl)
break;
  
  	case NVME_CTRL_RESETTING:

+   case NVME_CTRL_RESET_PREPARE:
/*
 * Controller is already in the process of terminating the
 * association. No need to do anything further. The reconnect
@@ -790,6 +791,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
break;
  
  	case NVME_CTRL_RESETTING:

+   case NVME_CTRL_RESET_PREPARE:
/*
 * Controller is already in the process of terminating the
 * association.  No need to do anything further. The reconnect


James, can you take a look at this?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e477c35 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1141,8 +1141,13 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 
csts)
bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
  
  	/* If there is a reset ongoing, we shouldn't reset again. */

-   if (dev->ctrl.state == NVME_CTRL_RESETTING)
+   switch (dev->ctrl.state) {
+   case NVME_CTRL_RESETTING:
+   case

Re: [PATCH V3 2/2] nvme-pci: fix the timeout case when reset is ongoing

2018-01-14 Thread Sagi Grimberg

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e477c35..0530432 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1212,19 +1212,26 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
}
  
  	/*

-* Shutdown immediately if controller times out while starting. The
-* reset work will see the pci device disabled when it gets the forced
-* cancellation error. All outstanding requests are completed on
-* shutdown, so we return BLK_EH_HANDLED.
+* There could be two kinds of expired reqs when reset is ongoing.
+*  - Outstanding IO or admin requests from previous work before the
+*nvme_reset_work invokes nvme_dev_disable. Handle them as the
+*nvme_cancel_request.
+*  - Outstanding admin requests from the initializing procedure.
+*Set NVME_REQ_CANCELLED flag on them, then nvme_reset_work will
+*see the error, then disable the device and remove the ctrl.
 */
-   if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-   dev_warn(dev->ctrl.device,
-"I/O %d QID %d timeout, disable controller\n",
-req->tag, nvmeq->qid);
-   nvme_dev_disable(dev, false);
+   switch (dev->ctrl.state) {
+   case NVME_CTRL_RESET_PREPARE:
+   nvme_req(req)->status = NVME_SC_ABORT_REQ;
+   return BLK_EH_HANDLED;
+   case NVME_CTRL_RESETTING:
+   WARN_ON_ONCE(nvmeq->qid);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_HANDLED;
+   default:
+   break;
}
+


Indentation is off...


Re: [PATCH] drviers/nvme/target: fix max dup length for kstrndup

2018-01-14 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Sagi Grimberg



-   if (nvmeq->sq_cmds_io)
-   memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
-   else
-   memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
+   memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));


Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.


I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.


Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
update ordering always guaranteed?


If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
qp->sq.head += nreq;

/*
 * Make sure that descriptors are written before
 * doorbell record.
 */
wmb();

writel(qp->doorbell_qpn,
   to_mdev(ibqp->device)->uar_map + 
MLX4_SEND_DOORBELL);


/*
 * Make sure doorbells don't leak out of SQ spinlock
 * and reach the HCA out of order.
 */
mmiowb();
--

That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.

Thoughts?


Re: [PATCH V4 2/5] nvme: add helper interface to flush in-flight requests

2018-03-08 Thread Sagi Grimberg



On 03/08/2018 08:19 AM, Jianchao Wang wrote:

Currently, we use nvme_cancel_request to complete the request
forcedly. This has following defects:
  - It is not safe to race with the normal completion path.
blk_mq_complete_request is ok to race with timeout path,
but not with itself.
  - Cannot ensure all the requests have been handled. The timeout
path may grab some expired requests, blk_mq_complete_request
cannot touch them.

add two helper interface to flush in-flight requests more safely.
- nvme_abort_requests_sync
use nvme_abort_req to timeout all the in-flight requests and wait
until timeout work and irq completion path completes. More details
please refer to the comment of this interface.
- nvme_flush_aborted_requests
complete the requests 'aborted' by nvme_abort_requests_sync. It will
be invoked after the controller is disabled/shutdown.

Signed-off-by: Jianchao Wang 
---
  drivers/nvme/host/core.c | 96 
  drivers/nvme/host/nvme.h |  4 +-
  2 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7b8df47..e26759b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3567,6 +3567,102 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
  }
  EXPORT_SYMBOL_GPL(nvme_start_queues);
  
+static void nvme_abort_req(struct request *req, void *data, bool reserved)

+{
+   if (!blk_mq_request_started(req))
+   return;
+
+   dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+   "Abort I/O %d", req->tag);
+
+   /* The timeout path need identify this flag and return
+* BLK_EH_NOT_HANDLED, then the request will not be completed.
+* we will defer the completion after the controller is disabled or
+* shutdown.
+*/
+   set_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
+   blk_abort_request(req);
+}
+
+/*
+ * This function will ensure all the in-flight requests on the
+ * controller to be handled by timeout path or irq completion path.
+ * It has to pair with nvme_flush_aborted_requests which will be
+ * invoked after the controller has been disabled/shutdown and
+ * complete the requests 'aborted' by nvme_abort_req.
+ *
+ * Note, the driver layer will not be quiescent before disable
+ * controller, because the requests aborted by blk_abort_request
+ * are still active and the irq will fire at any time, but it can
+ * not enter into completion path, because the request has been
+ * timed out.
+ */
+void nvme_abort_requests_sync(struct nvme_ctrl *ctrl)
+{
+   struct nvme_ns *ns;
+
+   blk_mq_tagset_busy_iter(ctrl->tagset, nvme_abort_req, ctrl);
+   blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_abort_req, ctrl);
+   /*
+* ensure the timeout_work is queued, thus needn't to sync
+* the timer
+*/
+   kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+
+   down_read(&ctrl->namespaces_rwsem);
+
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   kblockd_schedule_work(&ns->queue->timeout_work);
+
+   list_for_each_entry(ns, &ctrl->namespaces, list)
+   flush_work(&ns->queue->timeout_work);
+
+   up_read(&ctrl->namespaces_rwsem);
+   /* This will ensure all the nvme irq completion path have exited */
+   synchronize_sched();
+}
+EXPORT_SYMBOL_GPL(nvme_abort_requests_sync);
+
+static void nvme_comp_req(struct request *req, void *data, bool reserved)


Not a very good name...


+{
+   struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
+
+   if (!test_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags))
+   return;
+
+   WARN_ON(!blk_mq_request_started(req));
+
+   if (ctrl->tagset && ctrl->tagset->ops->complete) {


What happens when this called on the admin tagset when the controller
does not have an io tagset?


+   clear_bit(NVME_REQ_ABORTED, &nvme_req(req)->flags);
+   /*
+* We set the status to NVME_SC_ABORT_REQ, then ioq request
+* will be requeued and adminq one will be failed.
+*/
+   nvme_req(req)->status = NVME_SC_ABORT_REQ;
+   /*
+* For ioq request, blk_mq_requeue_request should be better
+* here. But the nvme code will still setup the cmd even if
+* the RQF_DONTPREP is set. We have to use .complete to free
+* the cmd and then requeue it.


IMO, its better to fix nvme to not setup the command if RQF_DONTPREP is 
on (other than the things it must setup).



+*
+* For adminq request, invoking .complete directly will miss
+* blk_mq_sched_completed_request, but this is ok because we
+* won't have io scheduler for adminq.
+*/
+   ctrl->tagset->ops->complete(req);


I don't think that directly calling .complete 

Re: [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Not sure I understand the 'blocking' problem in this case.

We can build a bvec table from this req, and send them all
in send(),


I would like to avoid growing bvec tables and keep everything
preallocated. Plus, a bvec_iter operates on a bvec which means
we'll need a table there as well... Not liking it so far...


can this way avoid your blocking issue? You may see this
example in branch 'rq->bio != rq->biotail' of lo_rw_aio().


This is exactly an example of not ignoring the bios...


If this way is what you need, I think you are right, even we may
introduce the following helpers:

rq_for_each_bvec()
rq_bvecs()


I'm not sure how this helps me either. Unless we can set a bvec_iter to
span bvecs or have an abstract bio crossing when we re-initialize the
bvec_iter I don't see how I can ignore bios completely...


So looks nvme-tcp host driver might be the 2nd driver which benefits
from multi-page bvec directly.

The multi-page bvec V11 has passed my tests and addressed almost
all the comments during review on V10. I removed bio_vecs() in V11,
but it won't be big deal, we can introduce them anytime when there
is the requirement.


multipage-bvecs and nvme-tcp are going to conflict, so it would be good
to coordinate on this. I think that nvme-tcp host needs some adjustments
as setting a bvec_iter. I'm under the impression that the change is 
rather small and self-contained, but I'm not sure I have the full

picture here.


Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Sagi Grimberg




This does not hold at least for NVMe RDMA host driver. An example scenario
is when the RDMA connection is gone while the controller is being deleted.
In this case, the nvmf_reg_write32() for sending shutdown admin command by
the delete_work could be hung forever if the command is not completed by
the timeout handler.


If the queue is gone, this means that the queue has already flushed and
any commands that were inflight has completed with a flush error
completion...

Can you describe the scenario that caused this hang? When has the
queue became "gone" and when did the shutdown command execute?


Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Sagi Grimberg




Could you please take a look at this bug and code review?

We are seeing more instances of this bug and found that reconnect_work
could hang as well, as can be seen from below stacktrace.

  Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
  Call Trace:
  __schedule+0x2ab/0x880
  schedule+0x36/0x80
  schedule_timeout+0x161/0x300
  ? __next_timer_interrupt+0xe0/0xe0
  io_schedule_timeout+0x1e/0x50
  wait_for_completion_io_timeout+0x130/0x1a0
  ? wake_up_q+0x80/0x80
  blk_execute_rq+0x6e/0xa0
  __nvme_submit_sync_cmd+0x6e/0xe0
  nvmf_connect_admin_queue+0x128/0x190 [nvme_fabrics]
  ? wait_for_completion_interruptible_timeout+0x157/0x1b0
  nvme_rdma_start_queue+0x5e/0x90 [nvme_rdma]
  nvme_rdma_setup_ctrl+0x1b4/0x730 [nvme_rdma]
  nvme_rdma_reconnect_ctrl_work+0x27/0x70 [nvme_rdma]
  process_one_work+0x179/0x390
  worker_thread+0x4f/0x3e0
  kthread+0x105/0x140
  ? max_active_store+0x80/0x80
  ? kthread_bind+0x20/0x20

This bug is produced by setting MTU of RoCE interface to '568' for
test while running I/O traffics.


I think that with the latest changes from Keith we can no longer rely
on blk-mq to barrier racing completions. We will probably need
to barrier ourselves in nvme-rdma...


[PATCH v5 10/13] nvme-tcp: Add protocol header

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

Signed-off-by: Sagi Grimberg 
---
 include/linux/nvme-tcp.h | 189 +++
 include/linux/nvme.h |   1 +
 2 files changed, 190 insertions(+)
 create mode 100644 include/linux/nvme-tcp.h

diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h
new file mode 100644
index ..03d87c0550a9
--- /dev/null
+++ b/include/linux/nvme-tcp.h
@@ -0,0 +1,189 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVMe over Fabrics TCP protocol header.
+ * Copyright (c) 2018 Lightbits Labs. All rights reserved.
+ */
+
+#ifndef _LINUX_NVME_TCP_H
+#define _LINUX_NVME_TCP_H
+
+#include 
+
+#define NVME_TCP_DISC_PORT 8009
+#define NVME_TCP_ADMIN_CCSZSZ_8K
+#define NVME_TCP_DIGEST_LENGTH 4
+
+enum nvme_tcp_pfv {
+   NVME_TCP_PFV_1_0 = 0x0,
+};
+
+enum nvme_tcp_fatal_error_status {
+   NVME_TCP_FES_INVALID_PDU_HDR= 0x01,
+   NVME_TCP_FES_PDU_SEQ_ERR= 0x02,
+   NVME_TCP_FES_HDR_DIGEST_ERR = 0x03,
+   NVME_TCP_FES_DATA_OUT_OF_RANGE  = 0x04,
+   NVME_TCP_FES_R2T_LIMIT_EXCEEDED = 0x05,
+   NVME_TCP_FES_DATA_LIMIT_EXCEEDED= 0x05,
+   NVME_TCP_FES_UNSUPPORTED_PARAM  = 0x06,
+};
+
+enum nvme_tcp_digest_option {
+   NVME_TCP_HDR_DIGEST_ENABLE  = (1 << 0),
+   NVME_TCP_DATA_DIGEST_ENABLE = (1 << 1),
+};
+
+enum nvme_tcp_pdu_type {
+   nvme_tcp_icreq  = 0x0,
+   nvme_tcp_icresp = 0x1,
+   nvme_tcp_h2c_term   = 0x2,
+   nvme_tcp_c2h_term   = 0x3,
+   nvme_tcp_cmd= 0x4,
+   nvme_tcp_rsp= 0x5,
+   nvme_tcp_h2c_data   = 0x6,
+   nvme_tcp_c2h_data   = 0x7,
+   nvme_tcp_r2t= 0x9,
+};
+
+enum nvme_tcp_pdu_flags {
+   NVME_TCP_F_HDGST= (1 << 0),
+   NVME_TCP_F_DDGST= (1 << 1),
+   NVME_TCP_F_DATA_LAST= (1 << 2),
+   NVME_TCP_F_DATA_SUCCESS = (1 << 3),
+};
+
+/**
+ * struct nvme_tcp_hdr - nvme tcp pdu common header
+ *
+ * @type:  pdu type
+ * @flags: pdu specific flags
+ * @hlen:  pdu header length
+ * @pdo:   pdu data offset
+ * @plen:  pdu wire byte length
+ */
+struct nvme_tcp_hdr {
+   __u8type;
+   __u8flags;
+   __u8hlen;
+   __u8pdo;
+   __le32  plen;
+};
+
+/**
+ * struct nvme_tcp_icreq_pdu - nvme tcp initialize connection request pdu
+ *
+ * @hdr:   pdu generic header
+ * @pfv:   pdu version format
+ * @hpda:  host pdu data alignment (dwords, 0's based)
+ * @digest:digest types enabled
+ * @maxr2t:maximum r2ts per request supported
+ */
+struct nvme_tcp_icreq_pdu {
+   struct nvme_tcp_hdr hdr;
+   __le16  pfv;
+   __u8hpda;
+   __u8digest;
+   __le32  maxr2t;
+   __u8rsvd2[112];
+};
+
+/**
+ * struct nvme_tcp_icresp_pdu - nvme tcp initialize connection response pdu
+ *
+ * @hdr:   pdu common header
+ * @pfv:   pdu version format
+ * @cpda:  controller pdu data alignment (dowrds, 0's based)
+ * @digest:digest types enabled
+ * @maxdata:   maximum data capsules per r2t supported
+ */
+struct nvme_tcp_icresp_pdu {
+   struct nvme_tcp_hdr hdr;
+   __le16  pfv;
+   __u8cpda;
+   __u8digest;
+   __le32  maxdata;
+   __u8rsvd[112];
+};
+
+/**
+ * struct nvme_tcp_term_pdu - nvme tcp terminate connection pdu
+ *
+ * @hdr:   pdu common header
+ * @fes:   fatal error status
+ * @fei:   fatal error information
+ */
+struct nvme_tcp_term_pdu {
+   struct nvme_tcp_hdr hdr;
+   __le16  fes;
+   __le32  fei;
+   __u8rsvd[8];
+};
+
+/**
+ * struct nvme_tcp_cmd_pdu - nvme tcp command capsule pdu
+ *
+ * @hdr:   pdu common header
+ * @cmd:   nvme command
+ */
+struct nvme_tcp_cmd_pdu {
+   struct nvme_tcp_hdr hdr;
+   struct nvme_command cmd;
+};
+
+/**
+ * struct nvme_tcp_rsp_pdu - nvme tcp response capsule pdu
+ *
+ * @hdr:   pdu common header
+ * @hdr:   nvme-tcp generic header
+ * @cqe:   nvme completion queue entry
+ */
+struct nvme_tcp_rsp_pdu {
+   struct nvme_tcp_hdr hdr;
+   struct nvme_completion  cqe;
+};
+
+/**
+ * struct nvme_tcp_r2t_pdu - nvme tcp ready-to-transfer pdu
+ *
+ * @hdr:   pdu common header
+ * @command_id:nvme command identifier which this relates to
+ * @ttag:  transfer tag (controller generated)
+ * @r2t_offset:offset from the start of the command data
+ * @r2t_length:length the host is allowed to send
+ */
+struct nvme_tcp_r2t_pdu {
+   

[PATCH v5 03/13] iov_iter: pass void csum pointer to csum_and_copy_to_iter

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram
and we are trying to unite its logic with skb_copy_datagram_iter by passing
a callback to the copy function that we want to apply. Thus, we need
to make the checksum pointer private to the function.

Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 include/linux/uio.h | 2 +-
 lib/iov_iter.c  | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 55ce99ddb912..41d1f8d3313d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -266,7 +266,7 @@ static inline void iov_iter_reexpand(struct iov_iter *i, 
size_t count)
 {
i->count = count;
 }
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, 
struct iov_iter *i);
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, 
struct iov_iter *i);
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct 
iov_iter *i);
 bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, 
struct iov_iter *i);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7ebccb5c1637..db93531ca3e3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,10 +1432,11 @@ bool csum_and_copy_from_iter_full(void *addr, size_t 
bytes, __wsum *csum,
 }
 EXPORT_SYMBOL(csum_and_copy_from_iter_full);
 
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum,
+size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump,
 struct iov_iter *i)
 {
const char *from = addr;
+   __wsum *csum = csump;
__wsum sum, next;
size_t off = 0;
sum = *csum;
-- 
2.17.1



[PATCH v5 02/13] datagram: open-code copy_page_to_iter

2018-12-03 Thread Sagi Grimberg
From: Sagi Grimberg 

This will be useful to consolidate skb_copy_and_hash_datagram_iter and
skb_copy_and_csum_datagram to a single code path.

Acked-by: David S. Miller 
Signed-off-by: Sagi Grimberg 
---
 net/core/datagram.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..abe642181b64 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -445,11 +445,14 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int 
offset,
 
end = start + skb_frag_size(frag);
if ((copy = end - offset) > 0) {
+   struct page *page = skb_frag_page(frag);
+   u8 *vaddr = kmap(page);
+
if (copy > len)
copy = len;
-   n = copy_page_to_iter(skb_frag_page(frag),
- frag->page_offset + offset -
- start, copy, to);
+   n = copy_to_iter(vaddr + frag->page_offset +
+offset - start, copy, to);
+   kunmap(page);
offset += n;
if (n != copy)
goto short_copy;
-- 
2.17.1



Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints

2018-06-26 Thread Sagi Grimberg




Looks good (both #1 if we want to use the two patch version or #2).

I have no idea what I did when I was trying to do when I tried it.


You want to submit or should I?


Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints

2018-06-26 Thread Sagi Grimberg




You want to submit or should I?


Whatever is easier for you


You're the owner, you should go ahead with it. Thanks!


Re: [PATCH] nvme: trace: add disk name to tracepoints

2018-06-27 Thread Sagi Grimberg




Not related to your patch, but I did notice that the req->q->id isn't
really useful here since that's not the hardware context identifier.
That's just some ida assigned software identifier. For the admin command
completion trace, it's actually a little confusing to see the qid in the
trace.


It was actually requested by Martin so we can easily see which request
got dispatched/completed on which request queue.


Would be good in the future to display the hwqid, but we'll need to work
before we can do that.

Maybe change the display to request_queue_id=%d?


Re: [PATCH v4 1/2] nvme: cache struct nvme_ctrl reference to struct nvme_request

2018-06-27 Thread Sagi Grimberg




@@ -652,6 +653,7 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req,
}
  
  	cmd->common.command_id = req->tag;

+   nvme_req(req)->ctrl = ctrl;
if (ns)
trace_nvme_setup_nvm_cmd(req->q->id, cmd);
else


I don't think we need to do this per-io. The request coming from the
controller's tagset, so we can do this just once at .init_request time,
right?


Agreed, but as it's Sagi's patch I'd like to hear his opinion on this
before changing it.


That's a good idea!

seems to work... Johannes, can you respin with this version?
Please CC james and keith on it too.
--
nvme: cache struct nvme_ctrl reference to struct nvme_request

We will need to reference the controller in the setup and
completion time for tracing and future traffic based keep alive
    support.

Signed-off-by: Sagi Grimberg 

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 41d45a1b5c62..9cc33752539a 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1737,6 +1737,7 @@ nvme_fc_init_request(struct blk_mq_tag_set *set, 
struct request *rq,

int queue_idx = (set == &ctrl->tag_set) ? hctx_idx + 1 : 0;
struct nvme_fc_queue *queue = &ctrl->queues[queue_idx];

+   nvme_req(rq)->ctrl = &ctrl->ctrl;
return __nvme_fc_init_request(ctrl, queue, op, rq, queue->rqcnt++);
 }

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0c4a33df3b2f..f2249387b60d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -102,6 +102,7 @@ struct nvme_request {
u8  retries;
u8  flags;
u16 status;
+   struct nvme_ctrl*ctrl;
 };

 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba943f211687..d2bd60d8315a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -418,6 +418,7 @@ static int nvme_init_request(struct blk_mq_tag_set 
*set, struct request *req,


BUG_ON(!nvmeq);
iod->nvmeq = nvmeq;
+   nvme_req(req)->ctrl = &dev->ctrl;
return 0;
 }

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 20db67d53ffa..e4087caf8754 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -292,6 +292,7 @@ static int nvme_rdma_init_request(struct 
blk_mq_tag_set *set,

return ret;

req->queue = queue;
+   nvme_req(rq)->ctrl = &ctrl->ctrl;

return 0;
 }
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index d8d91f04bd7e..af7fbf4132b0 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -227,6 +227,7 @@ static int nvme_loop_init_request(struct 
blk_mq_tag_set *set,

 {
struct nvme_loop_ctrl *ctrl = set->driver_data;

+   nvme_req(req)->ctrl = &ctrl->ctrl;
return nvme_loop_init_iod(ctrl, blk_mq_rq_to_pdu(req),
(set == &ctrl->tag_set) ? hctx_idx + 1 : 0);
 }
--


Re: [PATCH] nvme: trace: add disk name to tracepoints

2018-06-28 Thread Sagi Grimberg




Thinking more on this, not using the hw qid really limits the utility
out of using these trace events: We may not be able to match a completion
to the submission without it since cmdid alone isn't enough to match up
the two events.

Here's an updated proposal and actually tested. I was also able to
combine admin and io submissions.


This looks good!

One thing that is missing is the controller id when we have multiple
controllers/subsystems in the host like:

TP_printk("nvme%d: qid=%d, nsid=%u, cmdid=%u, flags=0x%x, meta=0x%llx, 
cmd=(%s %s)",


But Johannes can add it I guess.



Re: [PATCH v4 1/1] nvme: trace: add disk name to tracepoints

2018-06-19 Thread Sagi Grimberg




On 06/11/2018 09:46 PM, Johannes Thumshirn wrote:

Add disk name to tracepoints so we can better distinguish between
individual disks in the trace output.

Signed-off-by: Johannes Thumshirn 

---

I'm not entirely sure if this adding the ctrl pointers to
nvme_complete_rq() and nvme_setup_cmd() is a good idea, it's the
fast-path after all.


We are going to need it for traffic based keep alive to update
that we saw a completion to extend the kato.

But I suggest you simply keep a ctrl reference in struct nvme_request
instead so you don't need to pass it to nvme_complete_req (that's what
I did for traffic based keep alive).


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Sagi Grimberg

Hi Folks,

I'm sorry to chime in super late on this, but a lot has been
going on for me lately which got me off the grid.

So I'll try to provide my input hopefully without starting any more
flames..


This patch series aims to provide a more fine grained control over
nvme's native multipathing, by allowing it to be switched on and off
on a per-subsystem basis instead of a big global switch.


No.  The only reason we even allowed to turn multipathing off is
because you complained about installer issues.  The path forward
clearly is native multipathing and there will be no additional support
for the use cases of not using it.


We all basically knew this would be your position.  But at this year's
LSF we pretty quickly reached consensus that we do in fact need this.
Except for yourself, Sagi and afaik Martin George: all on the cc were in
attendance and agreed.


Correction, I wasn't able to attend LSF this year (unfortunately).


And since then we've exchanged mails to refine and test Johannes'
implementation.

You've isolated yourself on this issue.  Please just accept that we all
have a pretty solid command of what is needed to properly provide
commercial support for NVMe multipath.

The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other.  It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment.  The default _is_
native NVMe multipath.  It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".

Hopefully this clarifies things, thanks.


Mike, I understand what you're saying, but I also agree with hch on
the simple fact that this is a burden on linux nvme (although less 
passionate about it than hch).


Beyond that, this is going to get much worse when we support "dispersed
namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
namespaces" makes NVMe namespaces share-able over different subsystems
so changing the personality on a per-subsystem basis is just asking for
trouble.

Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.

Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.

As I said, I've been off the grid, can you remind me why global knob is
not sufficient?

This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just stack 
it on multipath nvme device? (I might be completely off on this so

feel free to correct my ignorance).


Re: [PATCH] nvme: trace: add disk name to tracepoints

2018-05-30 Thread Sagi Grimberg

Add disk name to tracepoints so we can better destinguish between
individual disks in the trace output.


Looks useful,

Reviewed-by: Sagi Grimberg 


Re: [PATCH -next] nvmet: fix a typo in nvmet_file_ns_enable()

2018-05-30 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/3] nvme: provide a way to disable nvme mpath per subsystem

2018-05-31 Thread Sagi Grimberg




@@ -246,3 +246,27 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
blk_cleanup_queue(head->disk->queue);
put_disk(head->disk);
  }
+
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys)
+{
+   struct nvme_ctrl *ctrl;
+   int ret = 0;
+
+restart:
+   mutex_lock(&subsys->lock);
+   list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+   if (!list_empty(&ctrl->namespaces)) {
+   mutex_unlock(&subsys->lock);
+   nvme_remove_namespaces(ctrl);


This looks completely broken. Any of these namespaces can have an
active handle on it.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Sagi Grimberg




Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
when features like "dispersed namespaces" land a negative check would
need to be added in the code to prevent switching from "native".

And once something like "dispersed namespaces" lands we'd then have to
see about a more sophisticated switch that operates at a different
granularity.  Could also be that switching one subsystem that is part of
"dispersed namespaces" would then cascade to all other associated
subsystems?  Not that dissimilar from the 3rd patch in this series that
allows a 'device' switch to be done in terms of the subsystem.


Which I think is broken by allowing to change this personality on the
fly.



Anyway, I don't know the end from the beginning on something you just
told me about ;)  But we're all in this together.  And can take it as it
comes.


I agree but this will be exposed to user-space and we will need to live
with it for a long long time...


I'm merely trying to bridge the gap from old dm-multipath while
native NVMe multipath gets its legs.

In time I really do have aspirations to contribute more to NVMe
multipathing.  I think Christoph's NVMe multipath implementation of
bio-based device ontop on NVMe core's blk-mq device(s) is very clever
and effective (blk_steal_bios() hack and all).


That's great.


Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.


I understand there will be limits to this 'mpath_personality' knob's
utility and it'll need to evolve over time.  But the burden of making
more advanced NVMe multipath features accessible outside of native NVMe
isn't intended to be on any of the NVMe maintainers (other than maybe
remembering to disallow the switch where it makes sense in the future).


I would expect that any "advanced multipath features" would be properly
brought up with the NVMe TWG as a ratified standard and find its way
to nvme. So I don't think this particularly is a valid argument.


As I said, I've been off the grid, can you remind me why global knob is
not sufficient?


Because once nvme_core.multipath=N is set: native NVMe multipath is then
not accessible from the same host.  The goal of this patchset is to give
users choice.  But not limit them to _only_ using dm-multipath if they
just have some legacy needs.

Tough to be convincing with hypotheticals but I could imagine a very
obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
"fabrics" (especially if/when the numa-based path selector lands).  But
the same host with PCI NVMe could be connected to a FC network that has
historically always been managed via dm-multipath.. but say that
FC-based infrastructure gets updated to use NVMe (to leverage a wider
NVMe investment, whatever?) -- but maybe admins would still prefer to
use dm-multipath for the NVMe over FC.


You are referring to an array exposing media via nvmf and scsi
simultaneously? I'm not sure that there is a clean definition of
how that is supposed to work (ANA/ALUA, reservations, etc..)


This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just
stack it on multipath nvme device? (I might be completely off on
this so feel free to correct my ignorance).


We could certainly pursue adding multipath-tools support for native NVMe
multipathing.  Not opposed to it (even if just reporting topology and
state).  But given the extensive lengths NVMe multipath goes to hide
devices we'd need some way to piercing through the opaque nvme device
that native NVMe multipath exposes.  But that really is a tangent
relative to this patchset.  Since that kind of visibility would also
benefit the nvme cli... otherwise how are users to even be able to trust
but verify native NVMe multipathing did what it expected it to?


Can you explain what is missing for multipath-tools to resolve topology?

nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl
topology but that is sort of the important information as controllers
are the real paths.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Sagi Grimberg




Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.


Noticed I didn't respond directly to this aspect.  As I explained in
various replies to this thread: The users/admins would be the ones who
would decide to use dm-multipath.  It wouldn't be something that'd be
imposed by default.  If anything, the all-or-nothing
nvme_core.multipath=N would pose a much more serious concern for these
array vendors that do have designs to specifically leverage native NVMe
multipath.  Because if users were to get into the habit of setting that
on the kernel commandline they'd literally _never_ be able to leverage
native NVMe multipathing.

We can also add multipath.conf docs (man page, etc) that caution admins
to consult their array vendors about whether using dm-multipath is to be
avoided, etc.

Again, this is opt-in, so on a upstream Linux kernel level the default
of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
is configured).  Not seeing why there is so much angst and concern about
offering this flexibility via opt-in but I'm also glad we're having this
discussion to have our eyes wide open.


I think that the concern is valid and should not be dismissed. And
at times flexibility is a real source of pain, both to users and
developers.

The choice is there, no one is forbidden to use multipath. I'm just
still not sure exactly why the subsystem granularity is an absolute
must other than a volume exposed as a nvmf namespace and scsi lun (how
would dm-multipath detect this is the same device btw?)


Re: [PATCHv2 1/5] nvme: fix lockdep warning in nvme_mpath_clear_current_path

2018-05-31 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 2/5] nvme: don't hold nvmf_transports_rwsem for more than transport lookups

2018-05-31 Thread Sagi Grimberg




diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 7ae732a77fe8..febf82639b40 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -957,16 +957,17 @@ nvmf_create_ctrl(struct device *dev, const char *buf, 
size_t count)
  
  	down_read(&nvmf_transports_rwsem);

ops = nvmf_lookup_transport(opts);
+   up_read(&nvmf_transports_rwsem);


And what protects us from the transport getting unregister right here
from anothet thread waiting to acquire nvmf_transports_rwsem?


I think having the module_get inside as well would protect against it.


Re: [PATCH 3/5] nvme: call nvmf_create_ctrl before checking for duplicate assignment

2018-05-31 Thread Sagi Grimberg




In general it seems like fc loop needs to offload any I/O to a workqueue
just like nvme-loop does, but even then I can't see how that is going to
cause an issue in this area.


Hmm I'll be looking into it.


FWIW, I agree it should do that as well.


Re: [PATCH 4/5] nvmet: use atomic allocations when allocating fc requests

2018-05-31 Thread Sagi Grimberg




diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index 34712def81b1..d2209c60f95f 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -509,7 +509,7 @@ fcloop_fcp_req(struct nvme_fc_local_port *localport,
if (!rport->targetport)
return -ECONNREFUSED;
  
-	tfcp_req = kzalloc(sizeof(*tfcp_req), GFP_KERNEL);

+   tfcp_req = kzalloc(sizeof(*tfcp_req), GFP_ATOMIC);


Question, why isn't tfcp_req embedded in fcpreq? don't they have
the same lifetime?


if (!tfcp_req)
return -ENOMEM;
  



Re: [PATCH 5/5] nvmet: fcloop: use irqsave spinlocks

2018-05-31 Thread Sagi Grimberg




Lockdep complains about inconsistent hardirq states when using
nvme-fcloop. So use the irqsave variant of spin_locks for the
nvmefc_fcp_req's reqlock.


Is this because of the nvmet bdev can trigger rx path in interrupt
context? Don't understand exactly what is violated.



Here's the lockdep report:
[   11.138417] 
[   11.139085] WARNING: inconsistent lock state
[   11.139207] nvmet: creating controller 2 for subsystem blktests-subsystem-1 
for NQN nqn.2014-08.org.nvmexpress:uuid:046394e7-bff7-4403-9502-1816800efaa0.
[   11.139727] 4.17.0-rc5+ #883 Not tainted
[   11.139732] 
[   11.11] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[   11.145341] swapper/8/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[   11.146108] (ptrval) (&(&tfcp_req->reqlock)->rlock){?.+.}, at: 
fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop]
[   11.148633] {HARDIRQ-ON-W} state was registered at:
[   11.149380]   _raw_spin_lock+0x34/0x70
[   11.149940]   fcloop_fcp_recv_work+0x17/0x90 [nvme_fcloop]
[   11.150746]   process_one_work+0x1d8/0x5c0
[   11.151346]   worker_thread+0x45/0x3e0
[   11.151886]   kthread+0x101/0x140
[   11.152370]   ret_from_fork+0x3a/0x50
[   11.152903] irq event stamp: 3
[   11.153402] hardirqs last  enabled at (36663): [] 
default_idle+0x13/0x180
[   11.154601] hardirqs last disabled at (36664): [] 
interrupt_entry+0xc4/0xe0
[   11.155817] softirqs last  enabled at (3): [] 
irq_enter+0x59/0x60
[   11.156952] softirqs last disabled at (36665): [] 
irq_enter+0x3e/0x60
[   11.158092]
[   11.158092] other info that might help us debug this:
[   11.159436]  Possible unsafe locking scenario:
[   11.159436]
[   11.160299]CPU0
[   11.160663]
[   11.161026]   lock(&(&tfcp_req->reqlock)->rlock);
[   11.161709]   
[   11.162091] lock(&(&tfcp_req->reqlock)->rlock);
[   11.163148]
[   11.163148]  *** DEADLOCK ***
[   11.163148]
[   11.164007] no locks held by swapper/8/0.
[   11.164596]
[   11.164596] stack backtrace:
[   11.165238] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.17.0-rc5+ #883
[   11.166180] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
[   11.167673] Call Trace:
[   11.168037]  
[   11.168349]  dump_stack+0x78/0xb3
[   11.168864]  print_usage_bug+0x1ed/0x1fe
[   11.169440]  ? check_usage_backwards+0x110/0x110
[   11.170111]  mark_lock+0x263/0x2a0
[   11.170995]  __lock_acquire+0x675/0x1870
[   11.171865]  ? __lock_acquire+0x48d/0x1870
[   11.172460]  lock_acquire+0xd4/0x220
[   11.172981]  ? fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop]
[   11.173709]  _raw_spin_lock+0x34/0x70
[   11.174243]  ? fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop]
[   11.174978]  fcloop_fcp_op+0x20/0x1d0 [nvme_fcloop]
[   11.175673]  nvmet_fc_transfer_fcp_data+0x9b/0x130 [nvmet_fc]
[   11.176507]  nvmet_req_complete+0x10/0x110 [nvmet]
[   11.177210]  nvmet_bio_done+0x23/0x40 [nvmet]
[   11.177837]  blk_update_request+0xab/0x3b0
[   11.178434]  blk_mq_end_request+0x13/0x60
[   11.179033]  flush_smp_call_function_queue+0x58/0x150
[   11.179755]  smp_call_function_single_interrupt+0x49/0x260
[   11.180531]  call_function_single_interrupt+0xf/0x20
[   11.181236]  
[   11.181542] RIP: 0010:native_safe_halt+0x2/0x10
[   11.182186] RSP: 0018:a481006cbec0 EFLAGS: 0206 ORIG_RAX: 
ff04
[   11.183265] RAX: 9f54f8b86200 RBX: 9f54f8b86200 RCX: 
[   11.184264] RDX: 9f54f8b86200 RSI: 0001 RDI: 9f54f8b86200
[   11.185270] RBP: 0008 R08:  R09: 
[   11.186271] R10: 0001 R11:  R12: 
[   11.187280] R13:  R14: 9f54f8b86200 R15: 9f54f8b86200
[   11.188280]  default_idle+0x18/0x180
[   11.188806]  do_idle+0x176/0x260
[   11.189273]  cpu_startup_entry+0x5a/0x60
[   11.189832]  start_secondary+0x18f/0x1b0

Signed-off-by: Johannes Thumshirn 
Cc: James Smart 
---
  drivers/nvme/target/fcloop.c | 52 
  1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index d2209c60f95f..f4a3700201c1 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -406,9 +406,10 @@ fcloop_fcp_recv_work(struct work_struct *work)
container_of(work, struct fcloop_fcpreq, fcp_rcv_work);
struct nvmefc_fcp_req *fcpreq = tfcp_req->fcpreq;
int ret = 0;
+   unsigned long flags;
bool aborted = false;
  
-	spin_lock(&tfcp_req->reqlock);

+   spin_lock_irqsave(&tfcp_req->reqlock, flags);
switch (tfcp_req->inistate) {
case INI_IO_START:
tfcp_req->inistate = INI_IO_ACTIVE;
@@ -417,11 +418,11 @@ fcloop_fcp_recv_work(struct work_struct *work)
aborted = true;
break;
default:
-   spin_unlock(&tfcp_req->reqlock);
+ 

Re: [PATCH -next] nvmet: fix error return code in nvmet_file_ns_enable()

2018-05-31 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-03 Thread Sagi Grimberg




I'm aware that most everything in multipath.conf is SCSI/FC specific.
That isn't the point.  dm-multipath and multipathd are an existing
framework for managing multipath storage.

It could be made to work with NVMe.  But yes it would not be easy.
Especially not with the native NVMe multipath crew being so damn
hostile.


The resistance is not a hostile act. Please try and keep the
discussion technical.


But I don't think the burden of allowing multipathd/DM to inject
themselves into the path transition state machine has any benefit
whatsoever to the user. It's only complicating things and therefore we'd
be doing people a disservice rather than a favor.


This notion that only native NVMe multipath can be successful is utter
bullshit.  And the mere fact that I've gotten such a reaction from a
select few speaks to some serious control issues.

Imagine if XFS developers just one day imposed that it is the _only_
filesystem that can be used on persistent memory.

Just please dial it back.. seriously tiresome.


Mike, you make a fair point on multipath tools being more mature
compared to NVMe multipathing. But this is not the discussion at all (at
least not from my perspective). There was not a single use-case that
gave a clear-cut justification for a per-subsystem personality switch
(other than some far fetched imaginary scenarios). This is not unusual
for the kernel community not to accept things with little to no use,
especially when it involves exposing a userspace ABI.

As for now, all I see is a disclaimer saying that it'd need to be
nurtured over time as the NVMe spec evolves.

Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.

Also, I get your point that exposing state/stats information to
userspace is needed. That's a fair comment.


Re: [PATCH v2] Bugfix for handling of shadow doorbell buffer.

2018-08-16 Thread Sagi Grimberg




diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..4452f8553301 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -306,6 +306,14 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, 
u32 *dbbuf_db,
old_value = *dbbuf_db;
*dbbuf_db = value;
  
+		/*

+* Ensure that the doorbell is updated before reading
+* the EventIdx from memory. NVMe controller should have
+* similar ordering guarantees - update EventIdx before
+* reading doorbell.
+*/
+   mb();
+
if (!nvme_dbbuf_need_event(*dbbuf_ei, value, old_value))
return false;
}


Reviewed-by: Sagi Grimberg 


Re: [PATCH] scatterlist: Update size type to support greater then 4GB size.

2018-12-11 Thread Sagi Grimberg




  struct nvme_sgl_desc {
__le64  addr;
-   __le32  length;
+   __le64  length;
__u8rsvd[3];
__u8type;
  };


in what world changing a wire protocol for this make sense?

please get rid of this hunk, NVMe will never cross the 32 bit sg element 
size.


Re: [PATCH] scatterlist: Update size type to support greater then 4GB size.

2018-12-11 Thread Sagi Grimberg




  struct nvme_sgl_desc {
__le64  addr;
-   __le32  length;
+   __le64  length;
__u8rsvd[3];
__u8type;
  };


Isn't this a device or protocol defined datastructure?  You can't just
change it like this.


You're correct, we can't...
[Replied before seeing this issue was already highlighted]

The positive side is that it can safely be removed without affecting the
rest of the patch...


Re: [PATCH v4 2/2] trace nvme submit queue status

2018-12-17 Thread Sagi Grimberg




@@ -899,6 +900,10 @@ static inline void nvme_handle_cqe(struct nvme_queue 
*nvmeq, u16 idx)
}
  
  	req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id);

+   trace_nvme_sq(req->rq_disk,
+   nvmeq->qid,
+   le16_to_cpu(cqe->sq_head),
+   nvmeq->sq_tail);


Why the newline escapes? why not escape at the 80 char border?

Other than that, looks fine,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v4 1/2] export trace.c helper functions to other modules

2018-12-17 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH][next] nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"

2018-12-14 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Sagi Grimberg

I cannot reproduce the bug with the patch; in my failure scenarios, it
seems that completing the request on errors in nvme_rdma_send_done
makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is
safe from the double completions.

However, it seems that nvme_rdma_timeout code is still not free from
the double completion problem. So, it looks promising to me if you
could separate out the nvme_rdma_wr_error handling code as a new
patch.


Guys, can you please send proper patches so we can review properly?


Re: [PATCH] nvme-core: don't initlialize ctrl->cntlid twice

2019-01-08 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 06/16] blk-mq: add 'type' attribute to the sysfs hctx directory

2018-10-30 Thread Sagi Grimberg




It can be useful for a user to verify what type a given hardware
queue is, expose this information in sysfs.


I assume the user is expected to understand the meaning of the
enumeration it sees in accessing this field correct?

Would be nice to output some meaningful string but I'm not
yet sure how to do that, but would be nice...

Other than that,
Reviewed-by: Sagi Grimberg 


Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"

2018-10-23 Thread Sagi Grimberg





I also commented over there regarding this specific problem.

I think my wording was inaccurate perhaps. It's not so much direct
recursion, but a dependency chain connecting the handler_mutex class
back to itself. This is how lockdep finds problems, it doesn't consider
actual *instances* but only *classes*.

As Sagi mentioned on the other thread, in this specific instance you
happen to know out-of-band that this is safe, (somehow) the code ensures
that you can't actually recursively connect back to yourself. Perhaps
there's some kind of layering involved, I don't know. (*)

As I said over there, this type of thing can be solved with
mutex_lock_nested(), though it's possible that in this case it should
really be flush_workqueue_nested() or flush_work_nested() or so.


I guess that could make this specific complaint complaint go away...


johannes

(*) note that just checking "obj != self" might not be enough, depending
on how you pick candidates for "obj" you might recurse down another
level and end up back at yourself, just two levels down, and actually
have a dependency chain leading back to yourself.


Its not a simple check, any removal invocation is serialized under a
dedicated workqueue and in order to be queued there, the connection need
to first be connected which comes later (after the flush takes place).


Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"

2018-10-22 Thread Sagi Grimberg




I must also say that I'm disappointed you'd try to do things this way.
I'd be (have been?) willing to actually help you understand the problem
and add the annotations, but rather than answer my question ("where do I
find the right git tree"!) you just send a revert patch.


Sorry that I had not yet provided that information. You should have
received this information through another e-mail thread. See also
http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html.


To do that, you have to understand what recursion is valid (I'm guessing
there's some sort of layering involved), and I'm far from understanding
anything about the code that triggered this report.


I don't think there is any kind of recursion involved in the NVMe code
that triggered the lockdep complaint. Sagi, please correct me if I got this
wrong.


I commented on the original thread. I'm not sure it qualifies as a
recursion, but in that use-case, when priv->handler_mutex is taken
it is possible that other priv->handler_mutex instances are taken but
are guaranteed not to belong to that priv...


Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-03 Thread Sagi Grimberg

Regardless of the API discussion,

The nvme-tcp bits look straight-forward:
Acked-by: Sagi Grimberg 



Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Sagi Grimberg




blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions 
with random command ids.


And this doesn't help because the command can have been easily reused
and started... What is this protecting against? Note that none of the
other transports checks that, why should tcp?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Sagi Grimberg




blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


'during TCP LIF toggles and aggr relocates' testing the host
crashes. TBH, I do not really know what is happening or what the test
does. Still trying to figure out what's going on.


Well, I think we should probably figure out why that is happening first.


I was just very
surprised how much the code trusts the other side to behave correctly./


What does pci/rdma/fc does differently? What does scsi do here?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions with
random command ids.


In this test scenario it's not even a spoofed frame; maybe just a
confused controller.


Maybe... I am still not sure how this patch helps here though...


Re: [target:nvme_of 3/3] drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 'nvme_of_fabric_configfs' was not declared. Should it be static?

2015-01-07 Thread Sagi Grimberg

On 1/7/2015 12:44 AM, Nicholas A. Bellinger wrote:

On Wed, 2015-01-07 at 03:35 +0800, kbuild test robot wrote:

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git 
nvme_of
head:   40d8c11927282d59855d645b35798edd97828da5
commit: 40d8c11927282d59855d645b35798edd97828da5 [3/3] nvme_of: Initial 
skeleton commit
reproduce:
   # apt-get install sparse
   git checkout 40d8c11927282d59855d645b35798edd97828da5
   make ARCH=x86_64 allmodconfig
   make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)


drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 
'nvme_of_fabric_configfs' was not declared. Should it be static?


Please review and possibly fold the followup patch.



Fixed.  Thanks Fengguang!



Hey Nic,

Hope all is well.

So this skeleton is interesting to me. As this is a fabric module I
assume this would be the NVMEoFabrics target mode driver for upstream 
which would be able to talk to NVMEoFabrics initiator (which by
the way makes LIO something more than a SCSI target - will this be 
accepted?).


I'm currently participating in nvmexpress working group defining the
standard of protocol, specifically in the context of RDMA (naturally).

I'm interested in taking an active role in this project, It is
important to build this layered from day one - separating the fabric 
logic (RDMA or FCoE) from the core layer.


Moreover, I know that Mellanox has some plans on accelerating this
area in future devices and we are currently looking into ways to
do that. I want to see that the SW will be able to support that.

So, Thoughts?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [target:nvme_of 3/3] drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 'nvme_of_fabric_configfs' was not declared. Should it be static?

2015-01-07 Thread Sagi Grimberg

>> On Wed, 2015-01-07 at 10:57 +0200, Sagi Grimberg wrote:
>>> On 1/7/2015 12:44 AM, Nicholas A. Bellinger wrote:
>>>> On Wed, 2015-01-07 at 03:35 +0800, kbuild test robot wrote:
>>>> tree:   
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git 
>>>> nvme_of
>>>> head:   40d8c11927282d59855d645b35798edd97828da5
>>>> commit: 40d8c11927282d59855d645b35798edd97828da5 [3/3] nvme_of: Initial 
>>>> skeleton commit
>>>> reproduce:
>>>>   # apt-get install sparse
>>>>   git checkout 40d8c11927282d59855d645b35798edd97828da5
>>>>   make ARCH=x86_64 allmodconfig
>>>>   make C=1 CF=-D__CHECK_ENDIAN__
>>>> 
>>>> 
>>>> sparse warnings: (new ones prefixed by >>)
>>>> 
>>>>>> drivers/target/nvme_of/nvme_of_configfs.c:25:31: sparse: symbol 
>>>>>> 'nvme_of_fabric_configfs' was not declared. Should it be static?
>>>> 
>>>> Please review and possibly fold the followup patch.
>>> 
>>> Fixed.  Thanks Fengguang!
>> 
>> Hey Nic,
>> 
>> Hope all is well.
>> 
>> So this skeleton is interesting to me. As this is a fabric module I
>> assume this would be the NVMEoFabrics target mode driver for upstream 
>> which would be able to talk to NVMEoFabrics initiator (which by
>> the way makes LIO something more than a SCSI target - will this be 
>> accepted?).
> 
> That is currently the plan.
> 
> It's still unclear how this will work wrt to existing target code, but
> given the amount of logic that NVMe is borrowing from SCSI on the
> command set side (PR, ALUA, VAAI, DIF), I'm sure there is an opportunity
> for code sharing between NVMe and SCSI based fabrics.

Right, but I assume there will be a different method to provision this stuff 
given that the target exposes NVME SQs.

> 
>> I'm currently participating in nvmexpress working group defining the
>> standard of protocol, specifically in the context of RDMA (naturally).
> 
> Glad to hear that your involved.  ;)

Should say passively involved, at the moment I'm just following this stuff.

> 
>> I'm interested in taking an active role in this project, It is
>> important to build this layered from day one - separating the fabric 
>> logic (RDMA or FCoE) from the core layer.
>> 
>> Moreover, I know that Mellanox has some plans on accelerating this
>> area in future devices and we are currently looking into ways to
>> do that. I want to see that the SW will be able to support that.
>> 
>> So, Thoughts?
> 
> FYI, Kiran and Dave (CC'ed) from Intel will be driving upstream NVME-OF
> target development.
> 
> I'll be primarily focused on figuring out how NVMe and SCSI can play
> nicely in target-core, and helping out on the NVMe-OF side where
> necessary.

Is there something open that I can start playing around with? What is the 
status here? I imagine that the initiator is developed as well...

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/2] target: Add TFO->complete_irq queue_work bypass

2015-06-04 Thread Sagi Grimberg

On 6/4/2015 10:06 AM, Nicholas A. Bellinger wrote:

On Wed, 2015-06-03 at 14:57 +0200, Christoph Hellwig wrote:

This makes lockdep very unhappy, rightly so.  If you execute
one end_io function inside another you basіcally nest every possible
lock taken in the I/O completion path.  Also adding more work
to the hardirq path generally isn't a smart idea.  Can you explain
what issues you were seeing and how much this helps?  Note that
the workqueue usage in the target core so far is fairly basic, so
there should some low hanging fruit.


So I've been using tcm_loop + RAMDISK backends for prototyping, but this
patch is intended for vhost-scsi so it can avoid the unnecessary
queue_work() context switch within target_complete_cmd() for all backend
driver types.

This is because vhost_work_queue() is just updating vhost_dev->work_list
and immediately wake_up_process() into a different vhost_worker()
process context.  For heavy small block workloads into fast IBLOCK
backends, avoiding this extra context switch should be a nice efficiency
win.


I can see that, did you get a chance to measure the expected latency
improvement?



Also, AFAIK RDMA fabrics are allowed to do ib_post_send() response
callbacks directly from IRQ context as well.


This is correct in general, ib_post_send is not allowed to schedule.
isert/srpt might benefit here in latency, but it would require the
the drivers to pre-allocate the sgls (ib_sge's) and use a worst-case
approach (or use GFP_ATOMIC allocations - I'm not sure which is
better...)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-03-01 Thread Sagi Grimberg



Note that we originally allocates irqs this way, and Keith changed
it a while ago for good reasons.  So I'd really like to see good
reasons for moving away from this, and some heuristics to figure
out which way to use.  E.g. if the device supports more irqs than
I/O queues your scheme might always be fine.


I still don't understand what this buys us in practice. Seems redundant
to allocate another vector without any (even marginal) difference.


Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-03-01 Thread Sagi Grimberg



On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe 
---
  drivers/infiniband/core/rw.c| 21 +
  drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
  drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ---
  drivers/nvme/target/rdma.c  |  6 +++---
  include/rdma/rw.h   |  7 +--
  net/sunrpc/xprtrdma/svc_rdma_rw.c   |  6 +++---
  6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
   */
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,

   * @remote_addr:remote address to read/write (relative to @rkey)
   * @rkey: remote key to operate on
   * @dir:  %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:  any of the RDMA_RW_CTX_FLAG_* flags
   *
   * Returns the number of WQEs that will be needed on the workqueue if
   * successful, or a negative error code.
   */
  int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-   u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+   u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+   unsigned int flags)
  {
struct ib_device *dev = qp->pd->device;
int ret;
  
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);

+   if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+   ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+   else
+   ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+


Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Sagi Grimberg



Hi Everyone,


Hi Logan,


Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
   as a bunch of cleanup and spelling fixes he pointed out in the last
   series.

* To address Alex's ACS concerns, we change to a simpler method of
   just disabling ACS behind switches for any kernel that has
   CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.


That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Sagi Grimberg



We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.


Nice.


Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.


Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?


+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+  struct nvmet_ns *ns)
+{
+   int ret;
+
+   if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+   pr_err("peer-to-peer DMA is not supported by %s\n",
+  ns->device_path);
+   return -EINVAL;


I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.


+   }
+
+   ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+   if (ret)
+   pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+  ns->device_path, ret);
+
+   return ret;
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;
  
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {

+   if (ctrl->p2p_dev) {
+   ret = nvmet_p2pdma_add_client(ctrl, ns);
+   if (ret)
+   goto out_remove_clients;


Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?


+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+   struct nvmet_ns *ns;
+   int ret;
+
+   if (!req->port->allow_p2pmem || !req->p2p_client)
+   return;
+
+   mutex_lock(&ctrl->subsys->lock);
+
+   ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client);
+   if (ret) {
+   pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+  dev_name(req->p2p_client), ret);
+   goto free_devices;
+   }
+
+   list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) {
+   ret = nvmet_p2pdma_add_client(ctrl, ns);
+   if (ret)
+   goto free_devices;
+   }
+
+   ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients);


This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?


+   if (!ctrl->p2p_dev) {
+   pr_info("no supported peer-to-peer memory devices found\n");
+   goto free_devices;
+   }
+   mutex_unlock(&ctrl->subsys->lock);
+
+   pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+   return;
+
+free_devices:
+   pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+   mutex_unlock(&ctrl->subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+   if (!ctrl->p2p_dev)
+   return;
+
+   mutex_lock(&ctrl->subsys->lock);
+
+   pci_p2pdma_client_list_free(&ctrl->p2p_clients);
+   pci_dev_put(ctrl->p2p_dev);
+   ctrl->p2p_dev = NULL;
+
+   mutex_unlock(&ctrl->subsys->lock);
+}
+
  u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
  {
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
  
  	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);

INIT_LIST_HEAD(&ctrl->async_events);
+   INIT_LIST_HEAD(&ctrl->p2p_clients);
  
  	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);

memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_c

Re: [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB

2018-03-01 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests

2018-03-01 Thread Sagi Grimberg



For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.


This looks fine to me,

Reviewed-by: Sagi Grimberg 

Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...


Re: [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue

2018-03-01 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Sagi Grimberg



On 01/03/18 04:03 AM, Sagi Grimberg wrote:

Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?


The patchset already supports CMB drives. That's essentially what patch 
7 is for. We change the nvme-pci driver to use the p2pmem code to 
register and manage the CMB memory. After that, it is simply available 
to the nvmet code. We have already been using this with a couple 
prototype CMB drives.


The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."

Maybe its a left-over which confused me...

Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.

nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.


+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+   struct nvmet_ns *ns)
+{
+    int ret;
+
+    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+    pr_err("peer-to-peer DMA is not supported by %s\n",
+   ns->device_path);
+    return -EINVAL;


I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.


No, it's actually a bit more complicated than you make it. There's a 
couple cases:


1) The user adds a namespace but there hasn't been a connection and no 
p2pmem has been selected. In this case the add_client function is never 
called and the user can add whatever namespace they like.


2) When the first connect happens, nvmet_setup_p2pmem() will call 
add_client() for each namespace and rdma device. If any of them fail 
then it does not use a P2P device and falls back, as you'd like, to 
regular memory.


3) When the user adds a namespace after a port is in use and a 
compatible P2P device has been found. In this case, if the user tries to 
add a namespace that is not compatible with the P2P device in use then 
it fails adding the new namespace. The only alternative here is to tear 
everything down, ensure there are no P2P enabled buffers open and start 
using regular memory again... That is very difficult.




Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?

So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...

Can you help me understand why this is absolutely not feasible?

I also disagree that these messages should be pr_debug. If a user is 
trying to use P2P memory and they enable it but the system doesn't 
choose to use their memory they must know why that is so they can make 
the necessary adjustments. If the system doesn't at least print a dmesg 
they are in the dark.


I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).




+    }
+
+    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns));
+    if (ret)
+    pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+   ns->device_path, ret);
+
+    return ret;
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
  struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
  if (ret)
  goto out_blkdev_put;
+    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+    if (ctrl->p2p_dev) {
+    ret = nvmet_p2pdma_add_client(ctrl, ns);
+    if (ret)
+    goto out_remove_clients;


Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?


See above. It's fatal because we are already using p2p memory and we 
can't easily tear that all down when a user adds a new namespace.


In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not pos

Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function

2018-04-04 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH] nvme-multipath: implement active-active round-robin path selector

2018-04-04 Thread Sagi Grimberg



For PCIe devices the right policy is not a round robin but to use
the pcie device closer to the node.  I did a prototype for that
long ago and the concept can work.  Can you look into that and
also make that policy used automatically for PCIe devices?


I think that active/active makes sense for fabrics (link throughput
aggregation) but also for dual-ported pci devices (given that
this is a real use-case).

I agree that the default can be a home-node path selection.


Re: [PATCH] nvme-multipath: implement active-active round-robin path selector

2018-04-04 Thread Sagi Grimberg



@@ -85,7 +121,14 @@ static blk_qc_t nvme_ns_head_make_request(struct 
request_queue *q,
int srcu_idx;
  
  	srcu_idx = srcu_read_lock(&head->srcu);

-   ns = nvme_find_path(head);
+   switch (head->mpath_policy) {
+   case NVME_MPATH_ROUND_ROBIN:
+   ns = nvme_find_path_rr(head);
+   break;
+   case NVME_MPATH_ACTIVE_STANDBY:
+   default:
+   ns = nvme_find_path(head);
+   }


If we grow multiple path selectors, would be more elegant to
use a callout mechanism.


Re: [PATCH 2/4] nvmet-rdma: Use new SGL alloc/free helper for requests

2018-04-04 Thread Sagi Grimberg



@@ -565,24 +565,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
  {
struct rdma_cm_id *cm_id = rsp->queue->cm_id;
u64 addr = le64_to_cpu(sgl->addr);
-   u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
int ret;
  
+	rsp->req.transfer_len = get_unaligned_le24(sgl->length);

+


IIRC, this might result in nvmet-rdma executing data-transfer even
for failed requests in some error cases. I'm not sure this is the
only case, but have you tested what happens in error cases?

See nvmet_rdma_need_data_in()/nvmet_rdma_need_data_out() which look
at transfer_len.


/* no data command? */
-   if (!len)
+   if (!rsp->req.transfer_len)
return 0;
  
-	rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt);

-   if (!rsp->req.sg)
-   return NVME_SC_INTERNAL;
+   ret = nvmet_req_alloc_sgl(&rsp->req, &rsp->queue->nvme_sq);
+   if (ret < 0)
+   goto error_out;
  
  	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,

rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
nvmet_data_dir(&rsp->req));
if (ret < 0)
-   return NVME_SC_INTERNAL;
-   rsp->req.transfer_len += len;
+   goto error_out;
rsp->n_rdma += ret;
  
  	if (invalidate) {

@@ -591,6 +591,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
}
  
  	return 0;

+
+error_out:
+   rsp->req.transfer_len = 0;
+   return NVME_SC_INTERNAL;
  }
  
  static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)




Re: [PATCH 3/4] nvmet-fc: Don't use the count returned by the dma_map_sg call

2018-04-04 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Sagi Grimberg

[so much for putting out flames... :/]


This projecting onto me that I've not been keeping the conversation
technical is in itself hostile.  Sure I get frustrated and lash out (as
I'm _sure_ you'll feel in this reply)


You're right, I do feel this is lashing out. And I don't appreciate it.
Please stop it. We're not going to make progress otherwise.


Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.


Already made the point multiple times in this thread [3][4][5][1].
Hint: it is about the users who have long-standing expertise and
automation built around dm-multipath and multipath-tools.  BUT those
same users may need/want to simultaneously use native NVMe multipath on
the same host.  Dismissing this point or acting like I haven't
articulated it just illustrates to me continuing this conversation is
not going to be fruitful.


The vast majority of the points are about the fact that people still
need to be able to use multipath-tools, which they still can today.
Personally, I question the existence of this user base you are referring
to which would want to maintain both dm-multipath and nvme personalities 
at the same time on the same host. But I do want us to make progress, so

I will have take this need as a given.

I agree with Christoph that changing personality on the fly is going to
be painful. This opt-in will need to be one-host at connect time. For
that, we will probably need to also expose an argument in nvme-cli too.
Changing the mpath personality will need to involve disconnecting the
controller and connecting again with the argument toggled. I think this
is the only sane way to do this.

Another path we can make progress in is user visibility. We have
topology in place and you mentioned primary path (which we could
probably add). What else do you need for multipath-tools to support
nvme?


Re: [PATCH v2] nvme: trace: add disk name to tracepoints

2018-06-04 Thread Sagi Grimberg




Add disk name to tracepoints so we can better destinguish between
individual disks in the trace output.

Signed-off-by: Johannes Thumshirn 
Reviewed-by: Sagi Grimberg 


Nit: s/destinguish/distinguish/g

Christoph, can you fix it up when applying or you want me
to do it?


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Sagi Grimberg




We plan to implement all the fancy NVMe standards like ANA, but it
seems that there is still a requirement to let the host side choose
policies about how to use paths (round-robin vs least queue depth for
example).  Even in the modern SCSI world with VPD pages and ALUA,
there are still knobs that are needed.  Maybe NVMe will be different
and we can find defaults that work in all cases but I have to admit
I'm skeptical...


The sensible thing to do in nvme is to use different paths for
different queues.


Huh? different paths == different controllers so this sentence can't
be right... you mean that a path selector will select a controller
based on the home node of the local rdma device connecting to it and
the running cpu right?


Re: [PATCH v2 0/2] Adding per-controller timeout support to nvme

2019-04-24 Thread Sagi Grimberg




As different nvme controllers are connect via different fabrics, some require
different timeout settings than others. This series implements per-controller
timeouts in the nvme subsystem which can be set via sysfs.


How much of a real issue is this?

block io_timeout defaults to 30 seconds which are considered a universal
eternity for pretty much any nvme fabric. Moreover, io_timeout is
mutable already on a per-namespace level.

This leaves the admin_timeout which goes beyond this to 60 seconds...

Can you describe what exactly are you trying to solve?


Re: [PATCH 1/3] drivers: nvme: target: core: fix build break

2019-04-24 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2 0/2] Adding per-controller timeout support to nvme

2019-04-24 Thread Sagi Grimberg




As different nvme controllers are connect via different fabrics, some require
different timeout settings than others. This series implements per-controller
timeouts in the nvme subsystem which can be set via sysfs.


How much of a real issue is this?

block io_timeout defaults to 30 seconds which are considered a universal
eternity for pretty much any nvme fabric. Moreover, io_timeout is
mutable already on a per-namespace level.

This leaves the admin_timeout which goes beyond this to 60 seconds...

Can you describe what exactly are you trying to solve?


I think they must have an nvme target that is backed by slow media
(i.e. non-SSD). If that's the case, I think it may be a better option
if the target advertises relatively shallow queue depths and/or lower
MDTS that better aligns to the backing storage capabilies.


It isn't that the media is slow; the max timeout is based on the SLA
for certain classes of "fabric" outages. Linux copes *really* badly
with I/O errors, and if we can make the timeout last long enough to
cover the switch restart worst case, then users are a lot happier.


Well, what is usually done to handle fabric outages is having multiple
paths to the storage device, not sure if that is applicable for you or
not...

What do you mean by "Linux copes *really* badly with I/O errors"? What
can be done better?


Re: [PATCH] nvmet: disable direct I/O when unavailable

2019-02-25 Thread Sagi Grimberg




Frankly, we have a ton of testing related special cases in the kernel.

This one is a) simple and small, only 10 LoC, b) far away from the fast
path or any other place where it could have any impact on legitimate
users and c) it prints an informal message showing you what happened.

Sorry but this is a https://xkcd.com/386/ moment.


How about we just fail it with a proper error message and let the
user set buffered_io instead of trying to get smart here...


Re: [PATCH] RDMA/mlx4: Spread completion vectors for proxy CQs

2019-02-25 Thread Sagi Grimberg




I was thinking of the stuff in core/cq.c - but it also doesn't have
automatic comp_vector balancing. It is the logical place to put
something like that though..

An API to manage a bundle of CPU affine CQ's is probably what most
ULPs really need.. (it makes little sense to create a unique CQ for
every QP)


ULPs behave way differently. E.g. RDS creates one tx and one rx CQ per QP.

As I wrote earlier, we do not have any modify_cq() that changes the comp_vector 
(EQ association). We can balance #CQ associated with the EQs, but we do not 
know their behaviour.

So, assume 2 completion EQs, and four CQs. CQa and CQb are associated with the first EQ, 
the two others with the second EQ. That's the "best" we can do. But, if CQa and 
CQb are the only ones generating events, we will have all interrupt processing on a 
single CPU. But if we now could modify CQa.comp_vector to be that of the second EQ, we 
could achieve balance. But not sure if the drivers are able to do this at all.


alloc_bundle()


You mean alloc a bunch of CQs? How do you know their #cqes and cq_context?


Håkon



get_cqn_for_flow(bundle)
alloc_qp()
destroy_qp()
put_cqn_for_flow(bundle)
destroy_bundle();

Let the core code balance the cqn's and allocate (shared) CQ
resources.

Jason




I sent a simple patchset back in the day for it [1], IIRC there was
some resistance of having multiple ULPs implicitly share the same
completion queues:

[1]:
--
RDMA/core: Add implicit per-device completion queue
 pools

Allow a ULP to ask the core to implicitly assign a completion
queue to a queue-pair based on a least-used search on a per-device
cq pools. The device CQ pools grow in a lazy fashion with every
QP creation.

In addition, expose an affinity hint for a queue pair creation.
If passed, the core will attempt to attach a CQ with a completion
vector that is directed to the cpu core as the affinity hint
provided.

Signed-off-by: Sagi Grimberg 
--

That one added implicit QP create flags:
--
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index bdb1279a415b..56d42e753eb4 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1098,11 +1098,22 @@ enum ib_qp_create_flags {
IB_QP_CREATE_SCATTER_FCS= 1 << 8,
IB_QP_CREATE_CVLAN_STRIPPING= 1 << 9,
IB_QP_CREATE_SOURCE_QPN = 1 << 10,
+
+   /* only used by the core, not passed to low-level drivers */
+   IB_QP_CREATE_ASSIGN_CQS = 1 << 24,
+   IB_QP_CREATE_AFFINITY_HINT  = 1 << 25,
+
--

Then I modified it to add a ib_cq_pool that a ULP can allocate
privately and then get/put CQs from/to.

[2]:
--
IB/core: Add a simple CQ pool API

Using CQ pools is useful especially for target/server modes.
The server/target implementation will usually serve multiple clients
and will usually have an array of completion queues allocated for that.

In addition, usually the server/target implementation will use a least-used
scheme to select a completion vector to each completion queue in order
to acheive better parallelism.

Having the server/target rdma queue-pairs share completion queues as
much as possible is desirable as it allows for better completion 
aggragation.

One downside of this approach is that some entries of the completion queues
might never be used in case the queue-pairs sizes are not fixed.

This simple CQ pool API allows for both optimizations and exposes a simple
API to alloc/free a completion queue pool and get/put from the pool.

The pool starts by allocating a caller-defined batch of CQs, and grows
in batches in a lazy fashion.

Signed-off-by: Sagi Grimberg 
--

That one had the CQ pool API:
--
+struct ib_cq_pool *ib_alloc_cq_pool(struct ib_device *device, int nr_cqe,
+   int nr_cqs, enum ib_poll_context poll_ctx);
+void ib_free_cq_pool(struct ib_cq_pool *pool);
+void ib_cq_pool_put(struct ib_cq *cq, unsigned int nents);
+struct ib_cq *ib_cq_pool_get(struct ib_cq_pool *pool, unsigned int nents);
--

I can try to revive this if this becomes interesting again to anyone..

Thoughts?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Sagi Grimberg

blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Sagi Grimberg




blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..


Re: [PATCH] nvme: convert sysfs sprintf/snprintf family to sysfs_emit

2021-02-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH v2] nvme-multipath: Early exit if no path is available

2021-01-28 Thread Sagi Grimberg




You can't see exactly where it dies but I followed the assembly to
nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head,
old) which returns NULL but nvme_next_ns() is returning NULL eventually
(list_next_or_null_rcu()).

So there is other bug cause nvme_next_ns abormal.
I review the code about head->list and head->current_path, I find 2 bugs
may cause the bug:
First, I already send the patch. see:
https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ 


Second, in nvme_ns_remove, list_del_rcu is before
nvme_mpath_clear_current_path. This may cause "old" is deleted from the
"head", but still use "old". I'm not sure there's any other
consideration here, I will check it and try to fix it.


The reason why we first remove from head->list and only then clear
current_path is because the other way around there is no way
to guarantee that that the ns won't be assigned as current_path
again (because it is in head->list).

nvme_ns_remove fences continue of deletion of the ns by synchronizing
the srcu such that for sure the current_path clearance is visible.


Re: [PATCH v2] nvme-multipath: Early exit if no path is available

2021-01-28 Thread Sagi Grimberg




You can't see exactly where it dies but I followed the assembly to
nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head,
old) which returns NULL but nvme_next_ns() is returning NULL eventually
(list_next_or_null_rcu()).

So there is other bug cause nvme_next_ns abormal.
I review the code about head->list and head->current_path, I find 2 bugs
may cause the bug:
First, I already send the patch. see:
https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ 


Second, in nvme_ns_remove, list_del_rcu is before
nvme_mpath_clear_current_path. This may cause "old" is deleted from the
"head", but still use "old". I'm not sure there's any other
consideration here, I will check it and try to fix it.


The reason why we first remove from head->list and only then clear
current_path is because the other way around there is no way
to guarantee that that the ns won't be assigned as current_path
again (because it is in head->list).

ok, I see.


nvme_ns_remove fences continue of deletion of the ns by synchronizing
the srcu such that for sure the current_path clearance is visible.

The list will be like this:
head->next = ns1;
ns1->next = head;
old->next = ns1;
This may cause infinite loop in nvme_round_robin_path.
for (ns = nvme_next_ns(head, old);
 ns != old;
 ns = nvme_next_ns(head, ns))
The ns will always be ns1, and then infinite loop.


Who is being removed? I'm not following


Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-03 Thread Sagi Grimberg




Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new 
shared CQ mechanism")

https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next


in testcase: blktests
with following parameters:

test: srp-group1
ucode: 0x21



on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz 
with 4G memory


caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):





If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


user  :notice: [   44.688140] 2020-08-01 16:10:22 ./check srp/001 
srp/002 srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 
srp/010 srp/011 srp/012 srp/013 srp/015

user  :notice: [   44.706657] srp/001 (Create and remove LUNs)
user  :notice: [   44.718405] srp/001 (Create and remove 
LUNs) [passed]

user  :notice: [   44.729902] runtime  ...  1.972s
user  :notice: [   99.038748] IPMI BMC is not supported on this 
machine, skip bmc-watchdog setup!
user  :notice: [ 3699.039790] Sat Aug  1 17:11:22 UTC 2020 detected 
soft_timeout
user  :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o 
/tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests

Yamin and Max, can you take a look at this? The SRP tests from the
blktests repository pass reliably with kernel version v5.7 and before.
With label next-20200731 from linux-next however that test triggers the
following hang:


I will look into it.


FWIW, I ran into this as well with nvme-rdma, but it also reproduces
when I revert the shared CQ patch from nvme-rdma. Another data point
is that my tests passes with siw.


Re: [PATCH] nvmet-passthru: Reject commands with non-sgl flags set

2020-08-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()

2020-08-05 Thread Sagi Grimberg




This patch adds a check in nvmet_execute_passthru_cmd() to prevent the
following oops :-


Why LKML and not linux-nvme?


Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()

2020-08-05 Thread Sagi Grimberg




This patch adds a check in nvmet_execute_passthru_cmd() to prevent the
following oops :-


Why LKML and not linux-nvme?



My bad (+linux-nvme).


It doesn't have the patch. can you resend?


  1   2   3   4   5   >