Re: [PATCH v2] nvme-rdma: remove race conditions from IB signalling

2017-06-21 Thread Marta Rybczynska
> On Tue, Jun 06, 2017 at 02:59:43PM +0300, Sagi Grimberg wrote:
>> Christoph,
>>
>> Can you place a stable tag on this (4.11+)?
> 
> Added.

I do not see this one in nvme-4.13. Can we get it in, please?
We're seeing the races in our setup and this patch fixes it.

Thanks,
Marta


Backport Mellanox mlx5 patches to stable 4.9.y

2018-01-30 Thread Marta Rybczynska
Hello Mellanox maintainers,
I'd like to ask you to OK backporting two patches in mlx5 driver to 4.9 stable
tree (they're in master for some time already).

We have multiple deployment in 4.9 that are running into the bug fixed by those
patches. We're deploying patched kernels and the issue disappears.

The patches are:
1410a90ae449061b7e1ae19d275148f36948801b net/mlx5: Define interface bits for 
fencing UMR wqe
6e8484c5cf07c7ee632587e98c1a12d319dacb7c RDMA/mlx5: set UMR wqe fence according 
to HCA cap
 
Regards,
Marta


Re: Backport Mellanox mlx5 patches to stable 4.9.y

2018-02-01 Thread Marta Rybczynska
- Mail original -
> On Tue, Jan 30, 2018 at 10:12:51AM +0100, Marta Rybczynska wrote:
>> Hello Mellanox maintainers,
>> I'd like to ask you to OK backporting two patches in mlx5 driver to 4.9 
>> stable
>> tree (they're in master for some time already).
>> 
>> We have multiple deployment in 4.9 that are running into the bug fixed by 
>> those
>> patches. We're deploying patched kernels and the issue disappears.
>> 
>> The patches are:
>> 1410a90ae449061b7e1ae19d275148f36948801b net/mlx5: Define interface bits for
>> fencing UMR wqe
>> 6e8484c5cf07c7ee632587e98c1a12d319dacb7c RDMA/mlx5: set UMR wqe fence 
>> according
>> to HCA cap
> 
> Looks good, now queued up, thanks.
> 
> greg k-h

Thanks Greg!

Marta


Re: [PATCH 09/12] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-01-05 Thread Marta Rybczynska
> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> {
>   u16 tail = nvmeq->sq_tail;
> 
> - if (nvmeq->sq_cmds_io)
> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> - else
> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 

Why is it always memcpy() and not memcpy_toio()? I'd expect something like
if (nvmeq->sq_cmds_is_io)
memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
else
memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));

Marta


[RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-21 Thread Marta Rybczynska
NVMe driver uses threads for the work at device reset, including enabling
the PCIe device. When multiple NVMe devices are initialized, their reset
works may be scheduled in parallel. Then pci_enable_device_mem can be
called in parallel on multiple cores.

This causes a loop of enabling of all upstream bridges in
pci_enable_bridge(). pci_enable_bridge() causes multiple operations
including __pci_set_master and architecture-specific functions that
call ones like and pci_enable_resources(). Both __pci_set_master()
and pci_enable_resources() read PCI_COMMAND field in the PCIe space
and change it. This is done as read/modify/write.

Imagine that the PCIe tree looks like:
A - B - switch -  C - D
   \- E - F

D and F are two NVMe disks and all devices from B are not enabled and bus
mastering is not set. If their reset work are scheduled in parallel the two
modifications of PCI_COMMAND may happen in parallel without locking and the
system may end up with the part of PCIe tree not enabled.

The problem may also happen if other device is initialized in parallel to
a nvme disk.

This fix moves pci_enable_device_mem to the probe part of the driver that
is run sequentially to avoid the issue.

Signed-off-by: Marta Rybczynska 
Signed-off-by: Pierre-Yves Kerbrat 
---
 drivers/nvme/host/pci.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b7..af53854 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2515,6 +2515,14 @@ static int nvme_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
 
dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+   /*
+* Enable the device now to make sure that all accesses to bridges above
+* are done without races
+*/
+   result = pci_enable_device_mem(pdev);
+   if (result)
+   goto release_pools;
+
nvme_reset_ctrl(&dev->ctrl);
 
return 0;
-- 
1.8.3.1


Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-21 Thread Marta Rybczynska
> On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>> NVMe driver uses threads for the work at device reset, including enabling
>> the PCIe device. When multiple NVMe devices are initialized, their reset
>> works may be scheduled in parallel. Then pci_enable_device_mem can be
>> called in parallel on multiple cores.
>> 
>> This causes a loop of enabling of all upstream bridges in
>> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>> including __pci_set_master and architecture-specific functions that
>> call ones like and pci_enable_resources(). Both __pci_set_master()
>> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>> and change it. This is done as read/modify/write.
>> 
>> Imagine that the PCIe tree looks like:
>> A - B - switch -  C - D
>>\- E - F
>> 
>> D and F are two NVMe disks and all devices from B are not enabled and bus
>> mastering is not set. If their reset work are scheduled in parallel the two
>> modifications of PCI_COMMAND may happen in parallel without locking and the
>> system may end up with the part of PCIe tree not enabled.
> 
> Then looks serialized reset should be used, and I did see the commit
> 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> to mark controller state' in reset stress test.
> 
> But that commit only covers case of PCI reset from sysfs attribute, and
> maybe other cases need to be dealt with in similar way too.
> 

It seems to me that the serialized reset works for multiple resets of the
same device, doesn't it? Our problem is linked to resets of different devices
that share the same PCIe tree.

You're right that the problem we face might also come with manual resets
under certain conditions (I think that all devices in a subtree would need
to be disabled).

Thanks,
Marta


Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-21 Thread Marta Rybczynska
> On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
>> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
>> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>> > >> NVMe driver uses threads for the work at device reset, including 
>> > >> enabling
>> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
>> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
>> > >> called in parallel on multiple cores.
>> > >> 
>> > >> This causes a loop of enabling of all upstream bridges in
>> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>> > >> including __pci_set_master and architecture-specific functions that
>> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
>> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>> > >> and change it. This is done as read/modify/write.
>> > >> 
>> > >> Imagine that the PCIe tree looks like:
>> > >> A - B - switch -  C - D
>> > >>\- E - F
>> > >> 
>> > >> D and F are two NVMe disks and all devices from B are not enabled and 
>> > >> bus
>> > >> mastering is not set. If their reset work are scheduled in parallel the 
>> > >> two
>> > >> modifications of PCI_COMMAND may happen in parallel without locking and 
>> > >> the
>> > >> system may end up with the part of PCIe tree not enabled.
>> > > 
>> > > Then looks serialized reset should be used, and I did see the commit
>> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
>> > > to mark controller state' in reset stress test.
>> > > 
>> > > But that commit only covers case of PCI reset from sysfs attribute, and
>> > > maybe other cases need to be dealt with in similar way too.
>> > > 
>> > 
>> > It seems to me that the serialized reset works for multiple resets of the
>> > same device, doesn't it? Our problem is linked to resets of different 
>> > devices
>> > that share the same PCIe tree.
>> 
>> Given reset shouldn't be a frequent action, it might be fine to serialize all
>> reset from different devices.
> 
> The driver was much simpler when we had serialized resets in line with
> probe, but that had a bigger problems with certain init systems when
> you put enough nvme devices in your server, making them unbootable.
> 
> Would it be okay to serialize just the pci_enable_device across all
> other tasks messing with the PCI topology?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index cef5ce851a92..e0a2f6c0f1cf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>   int result = -ENOMEM;
>   struct pci_dev *pdev = to_pci_dev(dev->dev);
> 
> - if (pci_enable_device_mem(pdev))
> - return result;
> + pci_lock_rescan_remove();
> + result = pci_enable_device_mem(pdev);
> + pci_unlock_rescan_remove();
> + if (result)
> + return -ENODEV;
> 
>   pci_set_master(pdev);
> 

The problem may happen also with other device doing its probe and nvme running 
its
workqueue (and we probably have seen it in practice too). We were thinking 
about a lock
in the pci generic code too, that's why I've put the linux-pci@ list in copy.

Marta


Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-23 Thread Marta Rybczynska

> On Wed, Mar 21, 2018 at 05:10:56PM +0100, Marta Rybczynska wrote:
>> 
>> The problem may happen also with other device doing its probe and
>> nvme running its workqueue (and we probably have seen it in practice
>> too). We were thinking about a lock in the pci generic code too,
>> that's why I've put the linux-pci@ list in copy.
> 
> Yes, this is a generic problem in the PCI core.  We've tried to fix it
> in the past but haven't figured it out yet.
> 
> See 40f11adc7cd9 ("PCI: Avoid race while enabling upstream bridges")
> and 0f50a49e3008 ("Revert "PCI: Avoid race while enabling upstream
> bridges"").
> 
> It's not trivial, but if you figure out a good way to fix this, I'd be
> thrilled.
> 

Bjorn, Srinath, are you aware of anyone working on an updated fix
for this one?

Marta


Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-23 Thread Marta Rybczynska


- Mail original -
> De: "Marta Rybczynska" 
> À: "Keith Busch" 
> Cc: "Ming Lei" , ax...@fb.com, h...@lst.de, 
> s...@grimberg.me, linux-n...@lists.infradead.org,
> linux-kernel@vger.kernel.org, bhelg...@google.com, linux-...@vger.kernel.org, 
> "Pierre-Yves Kerbrat"
> 
> Envoyé: Mercredi 21 Mars 2018 17:10:56
> Objet: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

>> On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
>>> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
>>> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
>>> > >> NVMe driver uses threads for the work at device reset, including 
>>> > >> enabling
>>> > >> the PCIe device. When multiple NVMe devices are initialized, their 
>>> > >> reset
>>> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
>>> > >> called in parallel on multiple cores.
>>> > >> 
>>> > >> This causes a loop of enabling of all upstream bridges in
>>> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
>>> > >> including __pci_set_master and architecture-specific functions that
>>> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
>>> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
>>> > >> and change it. This is done as read/modify/write.
>>> > >> 
>>> > >> Imagine that the PCIe tree looks like:
>>> > >> A - B - switch -  C - D
>>> > >>\- E - F
>>> > >> 
>>> > >> D and F are two NVMe disks and all devices from B are not enabled and 
>>> > >> bus
>>> > >> mastering is not set. If their reset work are scheduled in parallel 
>>> > >> the two
>>> > >> modifications of PCI_COMMAND may happen in parallel without locking 
>>> > >> and the
>>> > >> system may end up with the part of PCIe tree not enabled.
>>> > > 
>>> > > Then looks serialized reset should be used, and I did see the commit
>>> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
>>> > > to mark controller state' in reset stress test.
>>> > > 
>>> > > But that commit only covers case of PCI reset from sysfs attribute, and
>>> > > maybe other cases need to be dealt with in similar way too.
>>> > > 
>>> > 
>>> > It seems to me that the serialized reset works for multiple resets of the
>>> > same device, doesn't it? Our problem is linked to resets of different 
>>> > devices
>>> > that share the same PCIe tree.
>>> 
>>> Given reset shouldn't be a frequent action, it might be fine to serialize 
>>> all
>>> reset from different devices.
>> 
>> The driver was much simpler when we had serialized resets in line with
>> probe, but that had a bigger problems with certain init systems when
>> you put enough nvme devices in your server, making them unbootable.
>> 
>> Would it be okay to serialize just the pci_enable_device across all
>> other tasks messing with the PCI topology?
>> 
>> ---
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index cef5ce851a92..e0a2f6c0f1cf 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>>  int result = -ENOMEM;
>>  struct pci_dev *pdev = to_pci_dev(dev->dev);
>> 
>> -if (pci_enable_device_mem(pdev))
>> -return result;
>> +pci_lock_rescan_remove();
>> +result = pci_enable_device_mem(pdev);
>> +pci_unlock_rescan_remove();
>> +if (result)
>> +return -ENODEV;
>> 
>>  pci_set_master(pdev);
>> 
> 
> The problem may happen also with other device doing its probe and nvme running
> its
> workqueue (and we probably have seen it in practice too). We were thinking 
> about
> a lock
> in the pci generic code too, that's why I've put the linux-pci@ list in copy.
> 

Keith, it looks to me that this is going to fix the issue between two nvme 
driver
instances at hotplug time. This is the one we didn't cover in the first patch.

We can see the issue at driver load (so at boot) and the lock isn't taken by the
generic non-rescan code. Other calls to pci_enable_device_mem aren't protected 
neither (see Bjorns message).

What do you think about applying both for now until we have a generic fix in 
pci?

Marta


Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

2018-08-17 Thread Marta Rybczynska



- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt 
b...@kernel.crashing.org wrote:

> This protects enable/disable operations using the state mutex to
> avoid races with, for example, concurrent enables on a bridge.
> 
> The bus hierarchy is walked first before taking the lock to
> avoid lock nesting (though it would be ok if we ensured that
> we always nest them bottom-up, it is better to just avoid the
> issue alltogether, especially as we might find cases where
> we want to take it top-down later).
> 
> Signed-off-by: Benjamin Herrenschmidt 


> 
> static void pci_enable_bridge(struct pci_dev *dev)
> {
>   struct pci_dev *bridge;
> - int retval;
> + int retval, enabled;
> 
>   bridge = pci_upstream_bridge(dev);
>   if (bridge)
>   pci_enable_bridge(bridge);
> 
> - if (pci_is_enabled(dev)) {
> - if (!dev->is_busmaster)
> - pci_set_master(dev);
> + /* Already enabled ? */
> + pci_dev_state_lock(dev);
> + enabled = pci_is_enabled(dev);
> + if (enabled && !dev->is_busmaster)
> + pci_set_master(dev);
> + pci_dev_state_unlock(dev);
> + if (enabled)
>   return;
> - }
> 

This looks complicated too me especially with the double locking. What do you
think about a function doing that check that used an unlocked version of
pcie_set_master?

Like:

dev_state_lock(dev);
enabled = pci_is_enabled(dev);
if (enabled &&  !dev->is_busmaster)
pci_set_master_unlocked();
pci_dev_state_unlock(dev);

BTW If I remember correctly the code today can set bus mastering multiple
times without checking if already done.

Marta


[PATCH v2] nvme: allow 64-bit results in passthru commands

2019-08-16 Thread Marta Rybczynska
It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.

As a result, it is not possible to implement IOL's NVMe Conformance
test 4.3 Case 1 for Fabrics targets [1] (page 123).

This issue has been already discussed [2], but without a solution.

This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.

[1] 
https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf
[2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html

Signed-off-by: Marta Rybczynska 
---
 drivers/nvme/host/core.c| 66 ++---
 include/uapi/linux/nvme_ioctl.h | 23 ++
 2 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8f3fbe5..2ddb095 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -849,7 +849,7 @@ static void *nvme_add_user_metadata(struct bio *bio, void 
__user *ubuf,
 static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, void __user *ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-   u32 meta_seed, u32 *result, unsigned timeout)
+   u32 meta_seed, u64 *result, unsigned timeout)
 {
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
@@ -890,7 +890,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
else
ret = nvme_req(req)->status;
if (result)
-   *result = le32_to_cpu(nvme_req(req)->result.u32);
+   *result = le64_to_cpu(nvme_req(req)->result.u64);
if (meta && !ret && !write) {
if (copy_to_user(meta_buffer, meta, meta_len))
ret = -EFAULT;
@@ -1331,6 +1331,54 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct 
nvme_ns *ns,
struct nvme_command c;
unsigned timeout = 0;
u32 effects;
+   u64 result;
+   int status;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+   if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+   return -EFAULT;
+   if (cmd.flags)
+   return -EINVAL;
+
+   memset(&c, 0, sizeof(c));
+   c.common.opcode = cmd.opcode;
+   c.common.flags = cmd.flags;
+   c.common.nsid = cpu_to_le32(cmd.nsid);
+   c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+   c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+   c.common.cdw10 = cpu_to_le32(cmd.cdw10);
+   c.common.cdw11 = cpu_to_le32(cmd.cdw11);
+   c.common.cdw12 = cpu_to_le32(cmd.cdw12);
+   c.common.cdw13 = cpu_to_le32(cmd.cdw13);
+   c.common.cdw14 = cpu_to_le32(cmd.cdw14);
+   c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+
+   if (cmd.timeout_ms)
+   timeout = msecs_to_jiffies(cmd.timeout_ms);
+
+   effects = nvme_passthru_start(ctrl, ns, cmd.opcode);
+   status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+   (void __user *)(uintptr_t)cmd.addr, cmd.data_len,
+   (void __user *)(uintptr_t)cmd.metadata, 
cmd.metadata_len,
+   0, &result, timeout);
+   nvme_passthru_end(ctrl, effects);
+
+   if (status >= 0) {
+   if (put_user(result, &ucmd->result))
+   return -EFAULT;
+   }
+
+   return status;
+}
+
+static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+   struct nvme_passthru_cmd64 __user *ucmd)
+{
+   struct nvme_passthru_cmd64 cmd;
+   struct nvme_command c;
+   unsigned timeout = 0;
+   u32 effects;
int status;
 
if (!capable(CAP_SYS_ADMIN))
@@ -1401,6 +1449,13 @@ static void nvme_put_ns_from_disk(struct nvme_ns_head 
*head, int idx)
srcu_read_unlock(&head->srcu, idx);
 }
 
+static bool is_admin_cmd(unsigned int cmd)
+{
+   if ((cmd == NVME_IOCTL_ADMIN_CMD) || (cmd == NVME_IOCTL_ADMIN64_CMD))
+   return true;
+   return false;
+}
+
 static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
 {
@@ -1418,13 +1473,13 @@ static int nvme_ioctl(struct block_device *bdev, 
fmode_t mode,
 * seperately and drop the ns SRCU reference early.  This avoids a
 * deadlock when deleting namespaces using the passthrough interface.
 */
-   if (cmd == NVME_IOCTL_ADMIN_CMD || is_sed_ioctl(cmd)) {
+   if (is_admin_cmd(cmd) || is_sed_ioctl(cmd)) {
struct nvme_ctrl *ctrl = ns->ctrl;
 
nvme_get_ctrl(ns->ctrl);
nvme_put_ns_from_disk(head, srcu_idx);
 
-   if (cmd == NVME_IOCTL

Re: [PATCH v2] nvme: allow 64-bit results in passthru commands

2019-08-26 Thread Marta Rybczynska



- On 22 Aug, 2019, at 02:06, Christoph Hellwig h...@lst.de wrote:

> On Fri, Aug 16, 2019 at 11:47:21AM +0200, Marta Rybczynska wrote:
>> It is not possible to get 64-bit results from the passthru commands,
>> what prevents from getting for the Capabilities (CAP) property value.
>> 
>> As a result, it is not possible to implement IOL's NVMe Conformance
>> test 4.3 Case 1 for Fabrics targets [1] (page 123).
>> 
>> This issue has been already discussed [2], but without a solution.
>> 
>> This patch solves the problem by adding new ioctls with a new
>> passthru structure, including 64-bit results. The older ioctls stay
>> unchanged.
> 
> Ok, with my idea not being suitable I think I'm fine with this approach, a
> little nitpick below:
> 
>> +static bool is_admin_cmd(unsigned int cmd)
>> +{
>> +if ((cmd == NVME_IOCTL_ADMIN_CMD) || (cmd == NVME_IOCTL_ADMIN64_CMD))
>> +return true;
>> +return false;
>> +}
> 
> No need for the inner braces.  But I'm actually not sure the current
> code structure is very suitable for extending it.
> 
>> +
>>  static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
>>  unsigned int cmd, unsigned long arg)
>>  {
>> @@ -1418,13 +1473,13 @@ static int nvme_ioctl(struct block_device *bdev, 
>> fmode_t
>> mode,
>>   * seperately and drop the ns SRCU reference early.  This avoids a
>>   * deadlock when deleting namespaces using the passthrough interface.
>>   */
>> -if (cmd == NVME_IOCTL_ADMIN_CMD || is_sed_ioctl(cmd)) {
>> +if (is_admin_cmd(cmd) || is_sed_ioctl(cmd)) {
> 
> So maybe for this check we should have a is_ctrl_iocl() helper instead
> that includes the is_sed_ioctl check.
> 
>>  struct nvme_ctrl *ctrl = ns->ctrl;
>>  
>>  nvme_get_ctrl(ns->ctrl);
>>  nvme_put_ns_from_disk(head, srcu_idx);
>>  
>> -if (cmd == NVME_IOCTL_ADMIN_CMD)
>> +if (is_admin_cmd(cmd))
>>  ret = nvme_user_cmd(ctrl, NULL, argp);
>>  else
>>  ret = sed_ioctl(ctrl->opal_dev, cmd, argp);
> 
> And then we can move this whole branch into a helper function,
> which then switches on the ioctl cmd, with sed_ioctl as the fallback.

Do you mean something like this?

+static bool is_ctrl_ioctl(unsigned int cmd)
+{
+   if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD)
+   return true;
+   if (is_sed_ioctl(cmd))
+   return true;
+   return false;
+}
+
+static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
+ void __user *argp,
+ struct nvme_ns_head *head,
+ int srcu_idx)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   int ret;
+
+   nvme_get_ctrl(ns->ctrl);
+   nvme_put_ns_from_disk(head, srcu_idx);
+
+   switch (cmd) {
+   case NVME_IOCTL_ADMIN_CMD:
+   ret = nvme_user_cmd(ctrl, NULL, argp);
+   break;
+   case NVME_IOCTL_ADMIN64_CMD:
+   ret = nvme_user_cmd64(ctrl, NULL, argp);
+   break;
+   default:
+   ret = sed_ioctl(ctrl->opal_dev, cmd, argp);
+   break;
+   }
+   nvme_put_ctrl(ctrl);
+   return ret;
+}
+
 static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
unsigned int cmd, unsigned long arg)
 {
@@ -1418,20 +1501,8 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t 
mode,
 * seperately and drop the ns SRCU reference early.  This avoids a
 * deadlock when deleting namespaces using the passthrough interface.
 */
-   if (cmd == NVME_IOCTL_ADMIN_CMD || is_sed_ioctl(cmd)) {
-   struct nvme_ctrl *ctrl = ns->ctrl;
-
-   nvme_get_ctrl(ns->ctrl);
-   nvme_put_ns_from_disk(head, srcu_idx);
-
-   if (cmd == NVME_IOCTL_ADMIN_CMD)
-   ret = nvme_user_cmd(ctrl, NULL, argp);
-   else
-   ret = sed_ioctl(ctrl->opal_dev, cmd, argp);
-
-   nvme_put_ctrl(ctrl);
-   return ret;
-   }
+   if (is_ctrl_ioctl(cmd))
+   return nvme_handle_ctrl_ioctl(ns, cmd, argp, head, srcu_idx);
 
switch (cmd) {
case NVME_IOCTL_ID:

Regards,
Marta


[PATCH v3] nvme: allow 64-bit results in passthru commands

2019-09-24 Thread Marta Rybczynska
It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.

As a result, it is not possible to implement IOL's NVMe Conformance
test 4.3 Case 1 for Fabrics targets [1] (page 123).

This issue has been already discussed [2], but without a solution.

This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.

[1] 
https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf
[2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html

Signed-off-by: Marta Rybczynska 
---
 drivers/nvme/host/core.c| 108 ++--
 include/uapi/linux/nvme_ioctl.h |  23 +
 2 files changed, 115 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1ede176..ac6a787 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -856,7 +856,7 @@ static void *nvme_add_user_metadata(struct bio *bio, void 
__user *ubuf,
 static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, void __user *ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-   u32 meta_seed, u32 *result, unsigned timeout)
+   u32 meta_seed, u64 *result, unsigned timeout)
 {
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
@@ -897,7 +897,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
else
ret = nvme_req(req)->status;
if (result)
-   *result = le32_to_cpu(nvme_req(req)->result.u32);
+   *result = le64_to_cpu(nvme_req(req)->result.u64);
if (meta && !ret && !write) {
if (copy_to_user(meta_buffer, meta, meta_len))
ret = -EFAULT;
@@ -1344,6 +1344,54 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct 
nvme_ns *ns,
struct nvme_command c;
unsigned timeout = 0;
u32 effects;
+   u64 result;
+   int status;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+   if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+   return -EFAULT;
+   if (cmd.flags)
+   return -EINVAL;
+
+   memset(&c, 0, sizeof(c));
+   c.common.opcode = cmd.opcode;
+   c.common.flags = cmd.flags;
+   c.common.nsid = cpu_to_le32(cmd.nsid);
+   c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+   c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+   c.common.cdw10 = cpu_to_le32(cmd.cdw10);
+   c.common.cdw11 = cpu_to_le32(cmd.cdw11);
+   c.common.cdw12 = cpu_to_le32(cmd.cdw12);
+   c.common.cdw13 = cpu_to_le32(cmd.cdw13);
+   c.common.cdw14 = cpu_to_le32(cmd.cdw14);
+   c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+
+   if (cmd.timeout_ms)
+   timeout = msecs_to_jiffies(cmd.timeout_ms);
+
+   effects = nvme_passthru_start(ctrl, ns, cmd.opcode);
+   status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+   (void __user *)(uintptr_t)cmd.addr, cmd.data_len,
+   (void __user *)(uintptr_t)cmd.metadata,
+   cmd.metadata_len, 0, &result, timeout);
+   nvme_passthru_end(ctrl, effects);
+
+   if (status >= 0) {
+   if (put_user(result, &ucmd->result))
+   return -EFAULT;
+   }
+
+   return status;
+}
+
+static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+   struct nvme_passthru_cmd64 __user *ucmd)
+{
+   struct nvme_passthru_cmd64 cmd;
+   struct nvme_command c;
+   unsigned timeout = 0;
+   u32 effects;
int status;
 
if (!capable(CAP_SYS_ADMIN))
@@ -1414,6 +1462,41 @@ static void nvme_put_ns_from_disk(struct nvme_ns_head 
*head, int idx)
srcu_read_unlock(&head->srcu, idx);
 }
 
+static bool is_ctrl_ioctl(unsigned int cmd)
+{
+   if (cmd == NVME_IOCTL_ADMIN_CMD || cmd == NVME_IOCTL_ADMIN64_CMD)
+   return true;
+   if (is_sed_ioctl(cmd))
+   return true;
+   return false;
+}
+
+static int nvme_handle_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
+ void __user *argp,
+ struct nvme_ns_head *head,
+ int srcu_idx)
+{
+   struct nvme_ctrl *ctrl = ns->ctrl;
+   int ret;
+
+   nvme_get_ctrl(ns->ctrl);
+   nvme_put_ns_from_disk(head, srcu_idx);
+
+   switch (cmd) {
+   case NVME_IOCTL_ADMIN_CMD:
+   ret = nvme_user_cmd(ctrl, NULL, argp);
+   break;
+   case NVME_IOCTL_ADMIN64_CMD:
+   ret = nvme_user_cmd64(ctrl, NULL, argp);
+   bre

Re: [PATCH v2] nvme: allow 64-bit results in passthru commands

2019-08-19 Thread Marta Rybczynska



- On 16 Aug, 2019, at 15:16, Christoph Hellwig h...@lst.de wrote:

> Sorry for not replying to the earlier version, and thanks for doing
> this work.
> 
> I wonder if instead of using our own structure we'd just use
> a full nvme SQE for the input and CQE for that output.  Even if we
> reserve a few fields that means we are ready for any newly used
> field (at least until the SQE/CQE sizes are expanded..).

We could do that, nvme_command and nvme_completion are already UAPI.
On the other hand that would mean not filling out certain fields like
command_id. Can do an approach like this.
> 
> On Fri, Aug 16, 2019 at 11:47:21AM +0200, Marta Rybczynska wrote:
>> It is not possible to get 64-bit results from the passthru commands,
>> what prevents from getting for the Capabilities (CAP) property value.
>> 
>> As a result, it is not possible to implement IOL's NVMe Conformance
>> test 4.3 Case 1 for Fabrics targets [1] (page 123).
> 
> Not that I'm not sure passing through fabrics commands is an all that
> good idea.  But we have pending NVMe TPs that use 64-bit result
> values as well, so this seems like a good idea in general.

I'm not sure how those tests could be implemented otherwise today.
The goal is to send an arbitrary command to the target and the passthru
command is AFAIK the only solution. If we have something better
I'm sure they will be ready to use it.

Regards,
Marta


Re: [PATCH v2] nvme: fix multipath crash when ANA desactivated

2019-07-12 Thread Marta Rybczynska



- On 10 Jul, 2019, at 18:38, Christoph Hellwig h...@lst.de wrote:

> On Wed, Jul 10, 2019 at 07:26:46AM +0200, Marta Rybczynska wrote:
>> Christoph, why would you like to put the use_ana function in the header?
>> It isn't used anywhere else outside of that file.
> 
> nvme_ctrl_use_ana has a single caller in core.c as well.

Thanks Christoph, I'm testing a version similar to yours.

Marta


[PATCH] nvme: allow 64-bit results in passthru commands

2019-07-26 Thread Marta Rybczynska
It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.

This issue has been already discussed [1], but without a solution.

This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.

[1] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html

Signed-off-by: Marta Rybczynska 
---
 drivers/nvme/host/core.c| 98 -
 include/uapi/linux/nvme_ioctl.h | 23 ++
 2 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cc09b81..c77d7ea 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -849,7 +849,7 @@ static void *nvme_add_user_metadata(struct bio *bio, void 
__user *ubuf,
 static int nvme_submit_user_cmd(struct request_queue *q,
struct nvme_command *cmd, void __user *ubuffer,
unsigned bufflen, void __user *meta_buffer, unsigned meta_len,
-   u32 meta_seed, u32 *result, unsigned timeout)
+   u32 meta_seed, u64 *result, unsigned timeout)
 {
bool write = nvme_is_write(cmd);
struct nvme_ns *ns = q->queuedata;
@@ -890,7 +890,7 @@ static int nvme_submit_user_cmd(struct request_queue *q,
else
ret = nvme_req(req)->status;
if (result)
-   *result = le32_to_cpu(nvme_req(req)->result.u32);
+   *result = le64_to_cpu(nvme_req(req)->result.u64);
if (meta && !ret && !write) {
if (copy_to_user(meta_buffer, meta, meta_len))
ret = -EFAULT;
@@ -1324,13 +1324,61 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, 
u32 effects)
nvme_queue_scan(ctrl);
 }
 
-static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
-   struct nvme_passthru_cmd __user *ucmd)
+static int nvme_user_cmd32(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+  struct nvme_passthru_cmd __user *ucmd)
 {
struct nvme_passthru_cmd cmd;
struct nvme_command c;
unsigned timeout = 0;
u32 effects;
+   u64 result;
+   int status;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EACCES;
+   if (copy_from_user(&cmd, ucmd, sizeof(cmd)))
+   return -EFAULT;
+   if (cmd.flags)
+   return -EINVAL;
+
+   memset(&c, 0, sizeof(c));
+   c.common.opcode = cmd.opcode;
+   c.common.flags = cmd.flags;
+   c.common.nsid = cpu_to_le32(cmd.nsid);
+   c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
+   c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+   c.common.cdw10 = cpu_to_le32(cmd.cdw10);
+   c.common.cdw11 = cpu_to_le32(cmd.cdw11);
+   c.common.cdw12 = cpu_to_le32(cmd.cdw12);
+   c.common.cdw13 = cpu_to_le32(cmd.cdw13);
+   c.common.cdw14 = cpu_to_le32(cmd.cdw14);
+   c.common.cdw15 = cpu_to_le32(cmd.cdw15);
+
+   if (cmd.timeout_ms)
+   timeout = msecs_to_jiffies(cmd.timeout_ms);
+
+   effects = nvme_passthru_start(ctrl, ns, cmd.opcode);
+   status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
+   (void __user *)(uintptr_t)cmd.addr, cmd.data_len,
+   (void __user *)(uintptr_t)cmd.metadata,
+   cmd.metadata_len, 0, &result, timeout);
+   nvme_passthru_end(ctrl, effects);
+
+   if (status >= 0) {
+   if (put_user(result, &ucmd->result))
+   return -EFAULT;
+   }
+
+   return status;
+}
+
+static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+  struct nvme_passthru_cmd64 __user *ucmd)
+{
+   struct nvme_passthru_cmd64 cmd;
+   struct nvme_command c;
+   unsigned timeout = 0;
+   u32 effects;
int status;
 
if (!capable(CAP_SYS_ADMIN))
@@ -1371,6 +1419,16 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct 
nvme_ns *ns,
return status;
 }
 
+static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
+   void __user *argp, unsigned int cmd)
+{
+   if (cmd == NVME_IOCTL_ADMIN_CMD)
+   return nvme_user_cmd32(ctrl, ns, argp);
+   else if (cmd == NVME_IOCTL_ADMIN64_CMD)
+   return nvme_user_cmd64(ctrl, ns, argp);
+   return -EINVAL;
+}
+
 /*
  * Issue ioctl requests on the first available path.  Note that unlike normal
  * block layer requests we will not retry failed request on another controller.
@@ -1401,6 +1459,13 @@ static void nvme_put_ns_from_disk(struct nvme_ns_head 
*head, int idx)
srcu_read_unlock(&head->srcu, idx);
 }
 
+static bool is_admin_cmd(unsigned int cmd)
+{
+   if ((cmd == NVME_IOCTL_ADMIN_CMD) || (cmd == NVME_IO

[PATCH v3] nvme: fix multipath crash when ANA deactivated

2019-07-22 Thread Marta Rybczynska
Fix a crash with multipath activated. It happends when ANA log
page is larger than MDTS and because of that ANA is disabled.
The driver then tries to access unallocated buffer when connecting
to a nvme target. The signature is as follows:

[  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
[  300.435387] nvme nvme0: disabling ANA support.
[  300.437835] nvme nvme0: creating 4 I/O queues.
[  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
[  300.464609] BUG: unable to handle kernel NULL pointer dereference at 
0008
[  300.466342] #PF error: [normal kernel read fault]
[  300.467385] PGD 0 P4D 0
[  300.467987] Oops:  [#1] SMP PTI
[  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
[  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 
55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.477374] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.478334] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.479784] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.481488] RBP:  R08: 0001 R09: 0044
[  300.483203] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.484928] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.486626] FS:  () GS:9130f7b8() 
knlGS:
[  300.488538] CS:  0010 DS:  ES:  CR0: 80050033
[  300.489907] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.491612] DR0:  DR1:  DR2: 
[  300.493303] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.494991] Call Trace:
[  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
[  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
[  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
[  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
[  300.500717]  ? __switch_to_asm+0x35/0x70
[  300.501663]  process_one_work+0x171/0x380
[  300.502340]  worker_thread+0x49/0x3f0
[  300.503079]  kthread+0xf8/0x130
[  300.503795]  ? max_active_store+0x80/0x80
[  300.504690]  ? kthread_bind+0x10/0x10
[  300.505502]  ret_from_fork+0x35/0x40
[  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm 
ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle 
ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle 
iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan 
ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter 
bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net 
virtio_console net_failover virtio_blk failover ata_piix serio_raw libata 
virtio_pci virtio_ring virtio
[  300.514984] CR2: 0008
[  300.515569] ---[ end trace faa2eefad7e7f218 ]---
[  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 
55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.520353] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.521229] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.522399] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.523560] RBP:  R08: 0001 R09: 0044
[  300.524734] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.525915] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.527084] FS:  () GS:9130f7b8() 
knlGS:
[  300.528396] CS:  0010 DS:  ES:  CR0: 80050033
[  300.529440] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.530739] DR0:  DR1:  DR2: 
[  300.531989] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.533264] Kernel panic - not syncing: Fatal exception
[  300.534338] Kernel Offset: 0x17c0 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---

Condition check refactoring from Christoph Hellwig.

Signed-off-by: Marta Rybczynska 
Tested-by: Jean-Baptiste Riaux 
---
 drivers/nvme/host/multipath.c | 8 ++--
 drivers/nvme/host/nvme.h  | 6 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

Re: [PATCH] nvme: fix multipath crash when ANA deactivated

2019-07-02 Thread Marta Rybczynska
- On 2 Jul, 2019, at 11:31, Hannes Reinecke h...@suse.de wrote:

> On 7/1/19 12:10 PM, Marta Rybczynska wrote:
>> Fix a crash with multipath activated. It happends when ANA log
>> page is larger than MDTS and because of that ANA is disabled.
>> When connecting the target, the driver in nvme_parse_ana_log
>> then tries to access nvme_mpath_init.ctrl->ana_log_buf that is
>> unallocated. The signature is as follows:
>> 
>> [  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
>> [  300.435387] nvme nvme0: disabling ANA support.
>> [  300.437835] nvme nvme0: creating 4 I/O queues.
>> [  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
>> [  300.464609] BUG: unable to handle kernel NULL pointer dereference at
>> 0008
>> [  300.466342] #PF error: [normal kernel read fault]
>> [  300.467385] PGD 0 P4D 0
>> [  300.467987] Oops:  [#1] SMP PTI
>> [  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
>> [  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> [  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
>> [  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
>> [  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 
>> 41 55
>> 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
>> 08
>> 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
>> [  300.477374] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
>> [  300.478334] RAX: 0001 RBX: 9130f1872258 RCX: 
>> 
>> [  300.479784] RDX: c06c4c30 RSI: 9130edad4280 RDI: 
>> 9130f1872258
>> [  300.481488] RBP:  R08: 0001 R09: 
>> 0044
>> [  300.483203] R10: 0220 R11: 0040 R12: 
>> 9130f18722c0
>> [  300.484928] R13: 9130f18722d0 R14: 9130edad4280 R15: 
>> 9130f18722c0
>> [  300.486626] FS:  () GS:9130f7b8()
>> knlGS:
>> [  300.488538] CS:  0010 DS:  ES:  CR0: 80050033
>> [  300.489907] CR2: 0008 CR3: 0002365e6000 CR4: 
>> 06e0
>> [  300.491612] DR0:  DR1:  DR2: 
>> 
>> [  300.493303] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [  300.494991] Call Trace:
>> [  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
>> [  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
>> [  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
>> [  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
>> [  300.500717]  ? __switch_to_asm+0x35/0x70
>> [  300.501663]  process_one_work+0x171/0x380
>> [  300.502340]  worker_thread+0x49/0x3f0
>> [  300.503079]  kthread+0xf8/0x130
>> [  300.503795]  ? max_active_store+0x80/0x80
>> [  300.504690]  ? kthread_bind+0x10/0x10
>> [  300.505502]  ret_from_fork+0x35/0x40
>> [  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm 
>> ib_core
>> nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle 
>> ip6table_filter
>> ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack
>> nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth
>> ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel 
>> udp_tunnel
>> sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs
>> libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover
>> virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio
>> [  300.514984] CR2: 0008
>> [  300.515569] ---[ end trace faa2eefad7e7f218 ]---
>> [  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
>> [  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 
>> 41 55
>> 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
>> 08
>> 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
>> [  300.520353] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
>> [  300.521229] RAX: 0001 RBX: 9130f1872258 RCX: 
>> 
>> [  300.522399] RDX: c06c4c30 RSI: 9130edad4280 RDI: 
>> 9130f1872258
>> [  300.523560] RBP:  R08: 0001 R09: 
>> 0044
>> [  300.524734] R10: 0220 R11: 0040 R12: 
>> 9130f18722c0
>> [  300.525915] R13: 9130f18722d0 R

[PATCH] nvme: fix multipath crash when ANA deactivated

2019-07-01 Thread Marta Rybczynska
Fix a crash with multipath activated. It happends when ANA log
page is larger than MDTS and because of that ANA is disabled.
When connecting the target, the driver in nvme_parse_ana_log
then tries to access nvme_mpath_init.ctrl->ana_log_buf that is
unallocated. The signature is as follows:

[  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
[  300.435387] nvme nvme0: disabling ANA support.
[  300.437835] nvme nvme0: creating 4 I/O queues.
[  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
[  300.464609] BUG: unable to handle kernel NULL pointer dereference at 
0008
[  300.466342] #PF error: [normal kernel read fault]
[  300.467385] PGD 0 P4D 0
[  300.467987] Oops:  [#1] SMP PTI
[  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
[  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 
55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.477374] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.478334] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.479784] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.481488] RBP:  R08: 0001 R09: 0044
[  300.483203] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.484928] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.486626] FS:  () GS:9130f7b8() 
knlGS:
[  300.488538] CS:  0010 DS:  ES:  CR0: 80050033
[  300.489907] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.491612] DR0:  DR1:  DR2: 
[  300.493303] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.494991] Call Trace:
[  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
[  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
[  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
[  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
[  300.500717]  ? __switch_to_asm+0x35/0x70
[  300.501663]  process_one_work+0x171/0x380
[  300.502340]  worker_thread+0x49/0x3f0
[  300.503079]  kthread+0xf8/0x130
[  300.503795]  ? max_active_store+0x80/0x80
[  300.504690]  ? kthread_bind+0x10/0x10
[  300.505502]  ret_from_fork+0x35/0x40
[  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm 
ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle 
ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle 
iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan 
ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter 
bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net 
virtio_console net_failover virtio_blk failover ata_piix serio_raw libata 
virtio_pci virtio_ring virtio
[  300.514984] CR2: 0008
[  300.515569] ---[ end trace faa2eefad7e7f218 ]---
[  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 
55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.520353] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.521229] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.522399] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.523560] RBP:  R08: 0001 R09: 0044
[  300.524734] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.525915] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.527084] FS:  () GS:9130f7b8() 
knlGS:
[  300.528396] CS:  0010 DS:  ES:  CR0: 80050033
[  300.529440] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.530739] DR0:  DR1:  DR2: 
[  300.531989] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.533264] Kernel panic - not syncing: Fatal exception
[  300.534338] Kernel Offset: 0x17c0 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Marta Rybczynska 
Tested-by: Jean-Baptiste Riaux 
---
 drivers/nvme/host/multipath.c | 12 ++--
 drivers/nvme/host/nvme.h  |  1 +
 2 files changed, 11 insertions(+), 2 d

[PATCH v2] nvme: fix multipath crash when ANA desactivated

2019-07-05 Thread Marta Rybczynska
Fix a crash with multipath activated. It happends when ANA log
page is larger than MDTS and because of that ANA is disabled.
The driver then tries to access unallocated buffer when connecting
to a nvme target. The signature is as follows:

[  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
[  300.435387] nvme nvme0: disabling ANA support.
[  300.437835] nvme nvme0: creating 4 I/O queues.
[  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
[  300.464609] BUG: unable to handle kernel NULL pointer dereference at 
0008
[  300.466342] #PF error: [normal kernel read fault]
[  300.467385] PGD 0 P4D 0
[  300.467987] Oops:  [#1] SMP PTI
[  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
[  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 
55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.477374] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.478334] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.479784] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.481488] RBP:  R08: 0001 R09: 0044
[  300.483203] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.484928] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.486626] FS:  () GS:9130f7b8() 
knlGS:
[  300.488538] CS:  0010 DS:  ES:  CR0: 80050033
[  300.489907] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.491612] DR0:  DR1:  DR2: 
[  300.493303] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.494991] Call Trace:
[  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
[  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
[  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
[  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
[  300.500717]  ? __switch_to_asm+0x35/0x70
[  300.501663]  process_one_work+0x171/0x380
[  300.502340]  worker_thread+0x49/0x3f0
[  300.503079]  kthread+0xf8/0x130
[  300.503795]  ? max_active_store+0x80/0x80
[  300.504690]  ? kthread_bind+0x10/0x10
[  300.505502]  ret_from_fork+0x35/0x40
[  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm 
ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle 
ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle 
iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan 
ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter 
bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net 
virtio_console net_failover virtio_blk failover ata_piix serio_raw libata 
virtio_pci virtio_ring virtio
[  300.514984] CR2: 0008
[  300.515569] ---[ end trace faa2eefad7e7f218 ]---
[  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
[  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 
55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 
08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
[  300.520353] RSP: 0018:a50e80fd7cb8 EFLAGS: 00010296
[  300.521229] RAX: 0001 RBX: 9130f1872258 RCX: 
[  300.522399] RDX: c06c4c30 RSI: 9130edad4280 RDI: 9130f1872258
[  300.523560] RBP:  R08: 0001 R09: 0044
[  300.524734] R10: 0220 R11: 0040 R12: 9130f18722c0
[  300.525915] R13: 9130f18722d0 R14: 9130edad4280 R15: 9130f18722c0
[  300.527084] FS:  () GS:9130f7b8() 
knlGS:
[  300.528396] CS:  0010 DS:  ES:  CR0: 80050033
[  300.529440] CR2: 0008 CR3: 0002365e6000 CR4: 06e0
[  300.530739] DR0:  DR1:  DR2: 
[  300.531989] DR3:  DR6: fffe0ff0 DR7: 0400
[  300.533264] Kernel panic - not syncing: Fatal exception
[  300.534338] Kernel Offset: 0x17c0 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Marta Rybczynska 
Tested-by: Jean-Baptiste Riaux 
---
 drivers/nvme/host/multipath.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 499acf0

Re: [PATCH v2] nvme: fix multipath crash when ANA desactivated

2019-07-09 Thread Marta Rybczynska



- On 9 Jul, 2019, at 23:29, Christoph Hellwig h...@lst.de wrote:

> On Sat, Jul 06, 2019 at 01:06:44PM +0300, Max Gurtovoy wrote:
>>> +   /* check if multipath is enabled and we have the capability */
>>> +   if (!multipath)
>>> +   return 0;
>>> +   if (!ctrl->subsys || ((ctrl->subsys->cmic & (1 << 3)) != 0))
>>
>> shouldn't it be:
>>
>> if (!ctrl->subsys || ((ctrl->subsys->cmic & (1 << 3)) == 0))
>>
>> or
>>
>> if (!ctrl->subsys || !(ctrl->subsys->cmic & (1 << 3)))
>>
>>
>> Otherwise, you don't really do any initialization and return 0 in case you 
>> have
>> the capability, right ?
> 
> Yes.  FYI, my idea how to fix this would be something like:

Thanks both, error when changing the condition on my side. I submit the next
version very soon.

Christoph, why would you like to put the use_ana function in the header?
It isn't used anywhere else outside of that file.

Regards,
Marta


[PATCH] Documentation/locking/locktypes: fix local_locks documentation

2020-07-26 Thread Marta Rybczynska
Fix issues with local_locks documentation:
- fix function names, local_lock.h has local_unlock_irqrestore(),
not local_lock_irqrestore()
- fix mapping table, local_unlock_irqrestore() maps to local_irq_restore(),
not _save()

Signed-off-by: Marta Rybczynska 
---
 Documentation/locking/locktypes.rst | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/Documentation/locking/locktypes.rst
b/Documentation/locking/locktypes.rst
index 1b577a8bf982..2a9c38bfdf16 100644
--- a/Documentation/locking/locktypes.rst
+++ b/Documentation/locking/locktypes.rst
@@ -164,14 +164,14 @@ by disabling preemption or interrupts.
 On non-PREEMPT_RT kernels local_lock operations map to the preemption and
 interrupt disabling and enabling primitives:

- === ==
- local_lock(&llock)  preempt_disable()
- local_unlock(&llock)preempt_enable()
- local_lock_irq(&llock)  local_irq_disable()
- local_unlock_irq(&llock)local_irq_enable()
- local_lock_save(&llock) local_irq_save()
- local_lock_restore(&llock)  local_irq_save()
- === ==
+ ===  ==
+ local_lock(&llock)   preempt_disable()
+ local_unlock(&llock) preempt_enable()
+ local_lock_irq(&llock)   local_irq_disable()
+ local_unlock_irq(&llock) local_irq_enable()
+ local_lock_irqsave(&llock)   local_irq_save()
+ local_unlock_irqrestore(&llock)  local_irq_restore()
+ ===  ==

 The named scope of local_lock has two advantages over the regular
 primitives:
@@ -353,14 +353,14 @@ protection scope. So the following substitution is wrong::
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock_1, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock_1, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
   }

   func2()
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock_2, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock_2, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
   }

   func3()
@@ -379,14 +379,14 @@ PREEMPT_RT-specific semantics of spinlock_t. The
correct substitution is::
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
   }

   func2()
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
   }

   func3()
-- 
2.27.0


Re: [RFC 1/8] Introduce Peer-to-Peer memory (p2pmem) device

2017-04-25 Thread Marta Rybczynska
> On 02/04/17 03:03 PM, Sinan Kaya wrote:
>> Push the decision all the way to the user. Let them decide whether they
>> want this feature to work on a root port connected port or under the
>> switch.
> 
> Yes, I prefer this too. If other folks agree with that I'd be very happy
> to go back to user chooses. I think Sagi was the most vocal proponent
> for kernel chooses at LSF so hopefully he will read this thread and
> offer some opinion.
> 
>> I thought the issue was feature didn't work at all with some root ports
>> or there was some kind of memory corruption issue that you were trying to
>> avoid with the existing systems.
> 
> I *think* there are some much older root ports where P2P TLPs don't even
> get through. But it doesn't really change the situation: in the nvmet
> case, the user would enable p2pmem and then be unable to connect and
> thus choose to disable it going forward. Not a big difference from the
> user seeing bad performance and not choosing to enable it.
> 
>> I think you should get rid of all pci searching business in your code.
> 
> Yes, my original proposal was when you configure the nvme target you
> chose the specific p2pmem device to use. That code had no tie ins to PCI
> code and could, in theory, work generically with any device and bus.
> 

I would add one issue that doesn't seem to be addressed: in my experience
P2P doesn't work when IOMMU activated. It works best with deactivation at
the BIOS level, even the kernel options are not enough in some cases.

This is another argument to leave the chose to user/integrator.

Marta


Re: [PATCH v2] nvme-rdma: remove race conditions from IB signalling

2017-07-06 Thread Marta Rybczynska
> On Wed, Jun 21, 2017 at 05:14:41PM +0200, Marta Rybczynska wrote:
>> I do not see this one in nvme-4.13. Can we get it in, please?
>> We're seeing the races in our setup and this patch fixes it.
> 
> I've added it.  Note that your mail was whitespace damaged, so I had
> to apply it manually.  I used that opportunity to also move the comments
> around a bit.

Thank you! And sorry for the issue.

Best regards,
Marta


[PATCH v2] nvme-rdma: remove race conditions from IB signalling

2017-06-06 Thread Marta Rybczynska
This patch improves the way the RDMA IB signalling is done
by using atomic operations for the signalling variable. This
avoids race conditions on sig_count.

The signalling interval changes slightly and is now the
largest power of two not larger than queue depth / 2.

ilog() usage idea by Bart Van Assche.

Signed-off-by: Marta Rybczynska 
Reviewed-by: Sagi Grimberg 
---

Changes from v1:
* remove nvme_rdma_init_sig_count, put all into
  nvme_rdma_queue_sig_limit

---
 drivers/nvme/host/rdma.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255..4eb4846 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
struct nvme_rdma_qe *rsp_ring;
-   u8  sig_count;
+   atomic_tsig_count;
int queue_size;
size_t  cmnd_capsule_len;
struct nvme_rdma_ctrl   *ctrl;
@@ -554,6 +554,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
 
queue->queue_size = queue_size;
 
+   atomic_set(&queue->sig_count, 0);
+
queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
RDMA_PS_TCP, IB_QPT_RC);
if (IS_ERR(queue->cm_id)) {
@@ -1038,17 +1040,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
 {
-   int sig_limit;
+   int limit;
 
-   /*
-* We signal completion every queue depth/2 and also handle the
-* degenerated case of a  device with queue_depth=1, where we
-* would need to signal every message.
+   /* We want to signal completion at least every queue depth/2.
+* This returns the largest power of two that is not above half
+* of (queue size + 1) to optimize (avoid divisions).
 */
-   sig_limit = max(queue->queue_size / 2, 1);
-   return (++queue->sig_count % sig_limit) == 0;
+   limit = 1 << ilog2((queue->queue_size + 1) / 2);
+
+   /* Signal if sig_count is a multiple of limit */
+   return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
-- 
1.8.3.1


[PATCH] nvme-rdma: remove race conditions from IB signalling

2017-06-05 Thread Marta Rybczynska
This patch improves the way the RDMA IB signalling is done
by using atomic operations for the signalling variable. This
avoids race conditions on sig_count.

The signalling interval changes slightly and is now the
largest power of two not larger than queue depth / 2.

ilog() usage idea by Bart Van Assche.

Signed-off-by: Marta Rybczynska 
---
 drivers/nvme/host/rdma.c | 31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255..80682f7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
struct nvme_rdma_qe *rsp_ring;
-   u8  sig_count;
+   atomic_tsig_count;
int queue_size;
size_t  cmnd_capsule_len;
struct nvme_rdma_ctrl   *ctrl;
@@ -554,6 +554,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
 
queue->queue_size = queue_size;
 
+   atomic_set(&queue->sig_count, 0);
+
queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
RDMA_PS_TCP, IB_QPT_RC);
if (IS_ERR(queue->cm_id)) {
@@ -1038,17 +1040,26 @@ static void nvme_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+static inline int nvme_rdma_init_sig_count(int queue_size)
 {
-   int sig_limit;
-
-   /*
-* We signal completion every queue depth/2 and also handle the
-* degenerated case of a  device with queue_depth=1, where we
-* would need to signal every message.
+   /* We want to signal completion at least every queue depth/2.
+* This returns the largest power of two that is not above half
+* of (queue size + 1) to optimize (avoid divisions).
 */
-   sig_limit = max(queue->queue_size / 2, 1);
-   return (++queue->sig_count % sig_limit) == 0;
+   return 1 << ilog2((queue_size + 1) / 2);
+}
+
+static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+{
+   int count, limit;
+
+   limit = nvme_rdma_init_sig_count(queue->queue_size);
+   count = atomic_inc_return(&queue->sig_count);
+
+   /* Signal if count is a multiple of limit */
+   if ((count & (limit - 1)) == 0)
+   return true;
+   return false;
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
-- 
1.8.3.1


Re: [PATCH] nvme-rdma: remove race conditions from IB signalling

2017-06-05 Thread Marta Rybczynska

>> -static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +static inline int nvme_rdma_init_sig_count(int queue_size)
>>   {
>> -   int sig_limit;
>> -
>> -   /*
>> -* We signal completion every queue depth/2 and also handle the
>> -* degenerated case of a  device with queue_depth=1, where we
>> -* would need to signal every message.
>> +   /* We want to signal completion at least every queue depth/2.
>> +* This returns the largest power of two that is not above half
>> +* of (queue size + 1) to optimize (avoid divisions).
>>   */
>> -   sig_limit = max(queue->queue_size / 2, 1);
>> -   return (++queue->sig_count % sig_limit) == 0;
>> +   return 1 << ilog2((queue_size + 1) / 2);
>> +}
>> +
>> +static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +   int count, limit;
>> +
>> +   limit = nvme_rdma_init_sig_count(queue->queue_size);
> 
> Why calling it init? you're not initializing anything...
> 
> I'd just call it nvme_rdma_sig_limit()
> 
>> +   count = atomic_inc_return(&queue->sig_count);
>> +
>> +   /* Signal if count is a multiple of limit */
>> +   if ((count & (limit - 1)) == 0)
>> +   return true;
>> +   return false;
>>   }
> 
> You can replace it with:
>   return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
> 
> And lose the local count variable.

The init part is a leftover from one of the previous versions and we do not need
the value anywhere else. As the sig_limit() function is quite short now it I can
also put the ilog part + comments in the same place too. Wouldn't it be easier
to read?

Marta


Re: [PATCH] nvme-rdma: remove race conditions from IB signalling

2017-06-05 Thread Marta Rybczynska


- Mail original -
> De: "Marta Rybczynska" 
> À: "Sagi Grimberg" 
> Cc: ax...@fb.com, "Leon Romanovsky" , 
> linux-kernel@vger.kernel.org, linux-n...@lists.infradead.org,
> "keith busch" , "Doug Ledford" , 
> "Bart Van Assche"
> , h...@lst.de, "Jason Gunthorpe" 
> 
> Envoyé: Lundi 5 Juin 2017 13:39:40
> Objet: Re: [PATCH] nvme-rdma: remove race conditions from IB signalling

>>> -static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>> +static inline int nvme_rdma_init_sig_count(int queue_size)
>>>   {
>>> -   int sig_limit;
>>> -
>>> -   /*
>>> -* We signal completion every queue depth/2 and also handle the
>>> -* degenerated case of a  device with queue_depth=1, where we
>>> -* would need to signal every message.
>>> +   /* We want to signal completion at least every queue depth/2.
>>> +* This returns the largest power of two that is not above half
>>> +* of (queue size + 1) to optimize (avoid divisions).
>>>   */
>>> -   sig_limit = max(queue->queue_size / 2, 1);
>>> -   return (++queue->sig_count % sig_limit) == 0;
>>> +   return 1 << ilog2((queue_size + 1) / 2);
>>> +}
>>> +
>>> +static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>>> +{
>>> +   int count, limit;
>>> +
>>> +   limit = nvme_rdma_init_sig_count(queue->queue_size);
>> 
>> Why calling it init? you're not initializing anything...
>> 
>> I'd just call it nvme_rdma_sig_limit()
>> 
>>> +   count = atomic_inc_return(&queue->sig_count);
>>> +
>>> +   /* Signal if count is a multiple of limit */
>>> +   if ((count & (limit - 1)) == 0)
>>> +   return true;
>>> +   return false;
>>>   }
>> 
>> You can replace it with:
>>  return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
>> 
>> And lose the local count variable.
> 
> The init part is a leftover from one of the previous versions and we do not 
> need
> the value anywhere else. As the sig_limit() function is quite short now it I 
> can
> also put the ilog part + comments in the same place too. Wouldn't it be easier
> to read?
> 

It looks like this. What do you think Sagi?

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255..4eb4846 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
 
 struct nvme_rdma_queue {
struct nvme_rdma_qe *rsp_ring;
-   u8  sig_count;
+   atomic_tsig_count;
int queue_size;
size_t  cmnd_capsule_len;
struct nvme_rdma_ctrl   *ctrl;
@@ -554,6 +554,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
 
queue->queue_size = queue_size;
 
+   atomic_set(&queue->sig_count, 0);
+
queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
RDMA_PS_TCP, IB_QPT_RC);
if (IS_ERR(queue->cm_id)) {
@@ -1038,17 +1040,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, 
struct ib_wc *wc)
nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
+static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
 {
-   int sig_limit;
+   int limit;
 
-   /*
-* We signal completion every queue depth/2 and also handle the
-* degenerated case of a  device with queue_depth=1, where we
-* would need to signal every message.
+   /* We want to signal completion at least every queue depth/2.
+* This returns the largest power of two that is not above half
+* of (queue size + 1) to optimize (avoid divisions).
 */
-   sig_limit = max(queue->queue_size / 2, 1);
-   return (++queue->sig_count % sig_limit) == 0;
+   limit = 1 << ilog2((queue->queue_size + 1) / 2);
+
+   /* Signal if sig_count is a multiple of limit */
+   return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
-- 





Re: [PATCH] nvme-rdma: remove race conditions from IB signalling

2017-06-05 Thread Marta Rybczynska
> On 6/5/2017 12:45 PM, Marta Rybczynska wrote:
>> This patch improves the way the RDMA IB signalling is done
>> by using atomic operations for the signalling variable. This
>> avoids race conditions on sig_count.
>>
>> The signalling interval changes slightly and is now the
>> largest power of two not larger than queue depth / 2.
>>
>> ilog() usage idea by Bart Van Assche.
>>
>> Signed-off-by: Marta Rybczynska 
>> ---
>>  drivers/nvme/host/rdma.c | 31 +--
>>  1 file changed, 21 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 28bd255..80682f7 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -88,7 +88,7 @@ enum nvme_rdma_queue_flags {
>>
>>  struct nvme_rdma_queue {
>> struct nvme_rdma_qe *rsp_ring;
>> -   u8  sig_count;
>> +   atomic_tsig_count;
>> int queue_size;
>> size_t  cmnd_capsule_len;
>> struct nvme_rdma_ctrl   *ctrl;
>> @@ -554,6 +554,8 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl 
>> *ctrl,
>>
>> queue->queue_size = queue_size;
>>
>> +   atomic_set(&queue->sig_count, 0);
>> +
>> queue->cm_id = rdma_create_id(&init_net, nvme_rdma_cm_handler, queue,
>> RDMA_PS_TCP, IB_QPT_RC);
>> if (IS_ERR(queue->cm_id)) {
>> @@ -1038,17 +1040,26 @@ static void nvme_rdma_send_done(struct ib_cq *cq, 
>> struct
>> ib_wc *wc)
>> nvme_rdma_wr_error(cq, wc, "SEND");
>>  }
>>
>> -static inline int nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +static inline int nvme_rdma_init_sig_count(int queue_size)
>>  {
>> -   int sig_limit;
>> -
>> -   /*
>> -* We signal completion every queue depth/2 and also handle the
>> -* degenerated case of a  device with queue_depth=1, where we
>> -* would need to signal every message.
>> +   /* We want to signal completion at least every queue depth/2.
>> +* This returns the largest power of two that is not above half
>> +* of (queue size + 1) to optimize (avoid divisions).
>>  */
>> -   sig_limit = max(queue->queue_size / 2, 1);
>> -   return (++queue->sig_count % sig_limit) == 0;
>> +   return 1 << ilog2((queue_size + 1) / 2);
>> +}
>> +
>> +static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
>> +{
>> +   int count, limit;
>> +
>> +   limit = nvme_rdma_init_sig_count(queue->queue_size);
> 
> Maybe I'm missing something, but why we need to calc limit each time in
> data path (not a big deal but we can avoid that)?
> Can you add queue->sig_limit that will be calculated once at queue
> allocation ?

We've said last time that ilog() is cheap enough so that we do not
need to add this variable. It maps to fls() or a specific instruction
of a platform.

Marta


[tip: locking/core] Documentation/locking/locktypes: Fix local_locks documentation

2020-08-27 Thread tip-bot2 for Marta Rybczynska
The following commit has been merged into the locking/core branch of tip:

Commit-ID: 92b4e9f11a636d1723cc0866bf8b9111b1e24339
Gitweb:
https://git.kernel.org/tip/92b4e9f11a636d1723cc0866bf8b9111b1e24339
Author:Marta Rybczynska 
AuthorDate:Sun, 26 Jul 2020 20:54:40 +02:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 26 Aug 2020 12:42:02 +02:00

Documentation/locking/locktypes: Fix local_locks documentation

Fix issues with local_locks documentation:

 - fix function names, local_lock.h has local_unlock_irqrestore(), not
   local_lock_irqrestore()

 - fix mapping table, local_unlock_irqrestore() maps to
   local_irq_restore(), not _save()

Signed-off-by: Marta Rybczynska 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Will Deacon 
Link: 
https://lkml.kernel.org/r/CAApg2=skxq3sqwj6tznv-0x0cklxfkdapvxt4n15mpdmkq7...@mail.gmail.com
---
 Documentation/locking/locktypes.rst | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/Documentation/locking/locktypes.rst 
b/Documentation/locking/locktypes.rst
index 4cefed8..ddada4a 100644
--- a/Documentation/locking/locktypes.rst
+++ b/Documentation/locking/locktypes.rst
@@ -164,14 +164,14 @@ by disabling preemption or interrupts.
 On non-PREEMPT_RT kernels local_lock operations map to the preemption and
 interrupt disabling and enabling primitives:
 
- === ==
- local_lock(&llock)  preempt_disable()
- local_unlock(&llock)preempt_enable()
- local_lock_irq(&llock)  local_irq_disable()
- local_unlock_irq(&llock)local_irq_enable()
- local_lock_save(&llock) local_irq_save()
- local_lock_restore(&llock)  local_irq_save()
- === ==
+ ===  ==
+ local_lock(&llock)   preempt_disable()
+ local_unlock(&llock) preempt_enable()
+ local_lock_irq(&llock)   local_irq_disable()
+ local_unlock_irq(&llock) local_irq_enable()
+ local_lock_irqsave(&llock)   local_irq_save()
+ local_unlock_irqrestore(&llock)  local_irq_restore()
+ ===  ==
 
 The named scope of local_lock has two advantages over the regular
 primitives:
@@ -353,14 +353,14 @@ protection scope. So the following substitution is wrong::
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock_1, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock_1, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
   }
 
   func2()
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock_2, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock_2, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
   }
 
   func3()
@@ -379,14 +379,14 @@ PREEMPT_RT-specific semantics of spinlock_t. The correct 
substitution is::
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
   }
 
   func2()
   {
 local_irq_save(flags);-> local_lock_irqsave(&local_lock, flags);
 func3();
-local_irq_restore(flags); -> local_lock_irqrestore(&local_lock, flags);
+local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
   }
 
   func3()