Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Jaesoo Lee
Please drop this patch. However, it would be happy if this bug can be fixed as soon as possible. Nitzan, do you mind if you send your patch for review? On Tue, Dec 11, 2018 at 3:39 PM Sagi Grimberg wrote: > > > I cannot reproduce the bug with the patch; in my failure scenarios, it > > seems that

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Sagi Grimberg
I cannot reproduce the bug with the patch; in my failure scenarios, it seems that completing the request on errors in nvme_rdma_send_done makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is safe from the double completions. However, it seems that nvme_rdma_timeout code is still no

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Jaesoo Lee
I cannot reproduce the bug with the patch; in my failure scenarios, it seems that completing the request on errors in nvme_rdma_send_done makes __nvme_submit_sync_cmd to be unblocked. Also, I think this is safe from the double completions. However, it seems that nvme_rdma_timeout code is still not

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-11 Thread Nitzan Carmi
I was just in the middle of sending this to upstream when I saw your mail, and thought too that it addresses the same bug, although I see a little different call trace than yours. I would be happy if you can verify that this patch works for you too, and we can push it to upstream. On 11/12/2018

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-10 Thread Jaesoo Lee
It seems that your patch is addressing the same bug. I will see if that works for our failure scenarios. Why don't you make it upstream? On Sun, Dec 9, 2018 at 6:22 AM Nitzan Carmi wrote: > > Hi, > We encountered similar issue. > I think that the problem is that error_recovery might not even be

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-09 Thread Nitzan Carmi
Hi, We encountered similar issue. I think that the problem is that error_recovery might not even be queued, in case we're in DELETING state (or CONNECTING state, for that matter), because we cannot move from those states to RESETTING. We prepared some patches which handle completions in case suc

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Jaesoo Lee
Now, I see that my patch is not safe and can cause double completions. However, I am having a hard time finding out a good solution to barrier the racing completions. Could you suggest where the fix should go and what should it look like? We can provide more details on reproducing this issue if th

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Keith Busch
On Fri, Dec 07, 2018 at 12:05:37PM -0800, Sagi Grimberg wrote: > > > Could you please take a look at this bug and code review? > > > > We are seeing more instances of this bug and found that reconnect_work > > could hang as well, as can be seen from below stacktrace. > > > > Workqueue: nvme-wq

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-07 Thread Sagi Grimberg
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880 sc

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-12-06 Thread Jaesoo Lee
Could you please take a look at this bug and code review? We are seeing more instances of this bug and found that reconnect_work could hang as well, as can be seen from below stacktrace. Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Call Trace: __schedule+0x2ab/0x880 schedule+0

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Jaesoo Lee
Not the queue, but the RDMA connections. Let me describe the scenario. 1. connected nvme-rdma target with 500 namespaces : this will make the nvme_remove_namespaces() took a long time to complete and open the window vulnerable to this bug 2. host will take below code path for nvme_delete_ctrl_wor

Re: [PATCH] nvme-rdma: complete requests from ->timeout

2018-11-29 Thread Sagi Grimberg
This does not hold at least for NVMe RDMA host driver. An example scenario is when the RDMA connection is gone while the controller is being deleted. In this case, the nvmf_reg_write32() for sending shutdown admin command by the delete_work could be hung forever if the command is not completed