Regardless of the API discussion,
The nvme-tcp bits look straight-forward:
Acked-by: Sagi Grimberg
request tag can't be zero? I forget...
Of course it can. But the reserved tags are before the normal tags,
so 0 would be a reserved tag for nvme.
Right.
It is, but in this situation, the controller is sending a second
completion that results in a use-after-free, which makes the
transport irrelevant. Unless there is some other flow (which is
unclear
to me) that causes this which is a bug that needs to be fixed rather
than hidden with a safeguard
What we can do, though, is checking the 'state' field in the tcp
request, and only allow completions for commands which are in a state
allowing for completions.
Let's see if I can whip up a patch.
That would be great. BTW in the crash dump I am looking at now, it
looks like pdu->command_id
Reviewed-by: Sagi Grimberg
It is, but in this situation, the controller is sending a second
completion that results in a use-after-free, which makes the
transport irrelevant. Unless there is some other flow (which is
unclear
to me) that causes this which is a bug that needs to be fixed rather
than hidden with a safeguard
Reviewed-by: Sagi Grimberg
Acked-by: Sagi Grimberg
In nvmet_rdma_write_data_done, rsp is recoverd by wc->wr_cqe
and freed by nvmet_rdma_release_rsp(). But after that, pr_info()
used the freed chunk's member object and could leak the freed
chunk address with wc->wr_cqe by computing the offset.
Signed-off-by: Lv Yunlong
---
drivers/nvme/targe
Hi Sagi,
On Fri, Mar 05, 2021 at 11:57:30AM -0800, Sagi Grimberg wrote:
Daniel, again, there is nothing specific about this to nvme-tcp,
this is a safeguard against a funky controller (or a different
bug that is hidden by this).
As far I can tell, the main difference between nvme-tcp and
blk_mq_tag_to_rq() always returns a request if the tag id is in a
valid range [0...max_tags). If the target replies with a tag for which
we don't have a request but it's not started, the host will likely
corrupt data or simply crash.
Add an additional check if the a request has been started if
blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.
How did you get to this one? did the controller send a completion for
a comp
blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.
How did you get to this one? did the controller send a completion for
a comp
blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.
How did you get to this one? did the controller send a completion for
a comp
blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.
How did you get to this one? did the controller send a completion for
a complet
Reviewed-by: Sagi Grimberg
You can't see exactly where it dies but I followed the assembly to
nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head,
old) which returns NULL but nvme_next_ns() is returning NULL eventually
(list_next_or_null_rcu()).
So there is other bug cause nvme_next_ns abormal.
I revie
You can't see exactly where it dies but I followed the assembly to
nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head,
old) which returns NULL but nvme_next_ns() is returning NULL eventually
(list_next_or_null_rcu()).
So there is other bug cause nvme_next_ns abormal.
I revie
But this only catches a physical block size < 512 for NVMe, not any other block
device.
Please fix it for the general case in blk_queue_physical_block_size().
We actually call that later and would probably be better to check here..
We had a series for that a short while ago that got lost.
FWIW,
Reviewed-by: Sagi Grimberg
The nvme spec(1.4a, figure 248) says:
"A value smaller than 9 (i.e., 512 bytes) is not supported."
Signed-off-by: Li Feng
---
drivers/nvme/host/core.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f320273fc672..1f02e6e49
Dear all:
Thanks, again, for the very constructive decisions.
I am writing back with quite a few updates:
1. We have now included a detailed comparison of i10 scheduler with
Kyber with NVMe-over-TCP
(https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf).
In a nutshe
But if you think this has a better home, I'm assuming that the guys
will be open to that.
Also see the reply from Ming. It's a balancing act - don't want to add
extra overhead to the core, but also don't want to carry an extra
scheduler if the main change is really just variable dispatch batc
But if you think this has a better home, I'm assuming that the guys
will be open to that.
Also see the reply from Ming. It's a balancing act - don't want to add
extra overhead to the core, but also don't want to carry an extra
scheduler if the main change is really just variable dispatch batc
I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.
Ugh, I guess that Jaehyun had this pa
blk-mq actually has built-in batching(or sort of) mechanism, which is enabled
if the hw queue is busy(hctx->dispatch_busy is > 0). We use EWMA to compute
hctx->dispatch_busy, and it is adaptive, even though the implementation is quite
coarse. But there should be much space to improve, IMO.
Yo
I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.
Ugh, I guess that Jaehyun had this pa
There really aren't any rules for this, and it's perfectly legit to
complete from process context. Maybe you're a kthread driven driver and
that's how you handle completions. The block completion path has always
been hard IRQ safe, but possible to call from anywhere.
I'm not trying to put res
Reviewed-by: Sagi Grimberg
Well, usb-storage obviously seems to do it, and the block layer
does not prohibit it.
Also loop, nvme-tcp and then I stopped looking.
Any objections about adding local_bh_disable() around it?
To me it seems like the whole IPI plus potentially softirq dance is
a little pointless when complet
Well, usb-storage obviously seems to do it, and the block layer
does not prohibit it.
Also loop, nvme-tcp and then I stopped looking.
Any objections about adding local_bh_disable() around it?
To me it seems like the whole IPI plus potentially softirq dance is
a little pointless when complet
On 10/27/20 5:15 AM, zhenwei pi wrote:
In the zero KATO scenario, if initiator crashes without transport
layer disconnection, target side would never reclaim resources.
A target could start transport layer keep-alive to detect dead
connection for the admin queue.
Not sure why we should worry
Reviewed-by: Sagi Grimberg
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9e378d0a0c01..2ecadd309f4a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1767,6 +1767,21 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct
ib_wc *wc)
return;
}
Fixes:
Don't run keep alive work with zero kato.
"Fixes" tags need to have a git commit id followed by the commit
subject. I can't find any commit with that subject, though.
Fixes: 0d3b6a8d213a ("nvmet: Disable keep-alive timer when kato is
cleared to 0h")
Hit a warning:
WARNING: CPU: 1 PID: 241 at kernel/workqueue.c:1627
__queue_delayed_work+0x6d/0x90
with trace:
mod_delayed_work_on+0x59/0x90
nvmet_update_cc+0xee/0x100 [nvmet]
nvmet_execute_prop_set+0x72/0x80 [nvmet]
nvmet_tcp_try_recv_pdu+0x2f7/0x770 [nvmet_tcp]
nvmet_tcp_io_work+0
Yes, basically usage of managed affinity caused people to report
regressions not being able to change irq affinity from procfs.
Well, why would they change it? The whole point of the infrastructure
is that there is a single sane affinity setting for a given setup. Now
that setting needed som
1606.ga25...@ziepe.ca/
commit 759ace7832802eaefbca821b2b43a44ab896b449
Author: Sagi Grimberg
Date: Thu Nov 1 13:08:07 2018 -0700
i40iw: remove support for ib_get_vector_affinity
commit 9afc97c29b032af9a4112c2f4a02d5313b4dc71f
Author: Sagi Grimberg
Date: Thu Nov 1 09:13:12 2018 -0700
Reviewed-by: Sagi Grimberg
This causes a regression and was reverted upstream, just FYI.
On 9/17/20 7:06 PM, Sasha Levin wrote:
From: Israel Rukshin
[ Upstream commit ce1518139e6976cf19c133b555083354fdb629b8 ]
Calling nvme_sysfs_delete() when the controller is in the middle of
creation may cause several bugs. If the co
Reviewed-by: Sagi Grimberg
Added to nvme-5.9-rc
The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.
Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.
Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handli
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba725ae47305..c4f1ce0ee1e3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct
request *req, bool reserved)
dev_warn_rate
queued to nvme-5.9-rc
There's an unrelated whitespace change in nvme_init_identify().
Otherwise, looks fine.
Oops, sorry. can this be fixed up when it's merged?
Fixed and queued.
This patch adds a check in nvmet_execute_passthru_cmd() to prevent the
following oops :-
Why LKML and not linux-nvme?
My bad (+linux-nvme).
It doesn't have the patch. can you resend?
This patch adds a check in nvmet_execute_passthru_cmd() to prevent the
following oops :-
Why LKML and not linux-nvme?
Reviewed-by: Sagi Grimberg
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new
shared CQ mechanism")
https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next
in testcase: blktests
with following parameters:
test: srp-
Thanks for the review Christoph. I think I should be able to make all
the requested changes in the next week or two.
On 2020-07-20 1:35 p.m., Sagi Grimberg wrote:
I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q
On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote:
On 2020-07-20 4:35 p.m., Sagi Grimberg wrote:
passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you have a
dedicated request queue (mapped to th
On 7/20/20 4:17 PM, Keith Busch wrote:
On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote:
On 2020-07-20 4:35 p.m., Sagi Grimberg wrote:
passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you h
Thanks for the review Christoph. I think I should be able to make all
the requested changes in the next week or two.
On 2020-07-20 1:35 p.m., Sagi Grimberg wrote:
I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q
Thanks for the review Christoph. I think I should be able to make all
the requested changes in the next week or two.
On 2020-07-20 1:35 p.m., Sagi Grimberg wrote:
I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q
Add passthru command handling capability for the NVMeOF target and
export passthru APIs which are used to integrate passthru
code with nvmet-core.
The new file passthru.c handles passthru cmd parsing and execution.
In the passthru mode, we create a block layer request from the nvmet
request an
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Hi!
From: Sagi Grimberg
[ Upstream commit 3b4b19721ec652ad2c4fe51dfbe5124212b5f581 ]
Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition
Reviewed-by: Sagi Grimberg
From the NVMe spec, "In order to make efficient use of the non-volatile
memory, it is often advantageous to execute multiple commands from a
Submission Queue in parallel. For Submission Queues that are using
weighted round robin with urgent priority class or round robin
arbitration, host softw
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Reviewed-by: Sagi Grimberg
Acked-by: Sagi Grimberg
looks quite promising:
42 files changed, 721 insertions(+), 799 deletions(-)
For the nvme-tcp bits,
Acked-by: Sagi Grimberg
devices will benefit from the batching so maybe the flag needs to be
inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?
Actually I'd rather to not add any flag, and we may use some algorithm
(maybe EWMA or other intelligent stuff, or use hctx->dispatch_busy directly)
to figure out one dynamic batch
Basically, my idea is to dequeue request one by one, and for each
dequeued request:
- we try to get a budget and driver tag, if both succeed, add the
request to one per-task list which can be stored in stack variable,
then continue to dequeue more request
- if either budget or driver tag can'
You're mostly correct. This is exactly why an I/O scheduler may be
applicable here IMO. Mostly because I/O schedulers tend to optimize for
something specific and always present tradeoffs. Users need to
understand what they are optimizing for.
Hence I'd say this functionality can definitely be
Hey Ming,
Would it make sense to elevate this flag to a request_queue flag
(QUEUE_FLAG_ALWAYS_COMMIT)?
request queue flag usually is writable, however this case just needs
one read-only flag, so I think it may be better to make it as
tagset/hctx flag.
I actually intended it to be writable.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f389d7c724bd..6a20f8e8eb85 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -391,6 +391,7 @@ struct blk_mq_ops {
enum {
BLK_MQ_F_SHOULD_MERGE = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
+
This was already fixed and merged (by Dan)
On 10/2/19 5:43 AM, Colin King wrote:
From: Colin Ian King
Currently when the call to sysfs_create_link fails the error exit
path returns an uninitialized value in variable ret. Fix this by
returning the error code returned from the failed call to
sys
Looks fine to me,
Reviewed-by: Sagi Grimberg
Keith, Christoph?
Sagi,
Sorry it took a while to bring my system back online.
With the patch, the IOPS is about the same drop with the 1st patch. I think
the excessive context switches are causing the drop in IOPS.
The following are captured by "perf sched record" for 30 seconds during
tests.
"perf sched
Keith, can we get a review from you for this?
Hey Ming,
Ok, so the real problem is per-cpu bounded tasks.
I share Thomas opinion about a NAPI like approach.
We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not entirely
sure why that is, maybe its because we need to
It seems like we're attempting to stay in irq context for as long as we
can instead of scheduling to softirq/thread context if we have more than
a minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong
approach to me. I
This does not apply on nvme-5.4, can you please respin a patch
that cleanly applies?
Reviewed-by: Sagi Grimberg
Hey Ming,
Ok, so the real problem is per-cpu bounded tasks.
I share Thomas opinion about a NAPI like approach.
We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not
entirely sure why that is, maybe its because we need to mas
Ok, so the real problem is per-cpu bounded tasks.
I share Thomas opinion about a NAPI like approach.
We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not
entirely sure why that is, maybe its because we need to mask interr
Reviewed-by: Sagi Grimberg
applied to nvme-5.4
Looks good,
Reviewed-by: Sagi Grimberg
Applied to nvme-5.4
Applied to nvme-5.4
Still don't understand how this is ok...
I have /dev/nvme0 represents a network endpoint that I would discover
from, it is raising me an event to do a discovery operation (namely to
issue an ioctl to it) so my udev code calls a systemd script.
By the time I actually get to do that, /dev/nvme0
Yes we do, userspace should use it to order events. Does udev not
handle that properly today?
The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by
You are correct that this information can be derived from sysfs, but the
main reason why we add these here, is because in udev rule we can't
just go ahead and start looking these up and parsing these..
We could send the discovery aen with NVME_CTRL_NAME and have
then have systemd run something
Yes we do, userspace should use it to order events. Does udev not
handle that properly today?
The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by
You are correct that this information can be derived from sysfs, but the
main reason why we add these here, is because in udev rule we can't
just go ahead and start looking these up and parsing these..
We could send the discovery aen with NVME_CTRL_NAME and have
then have systemd run something
I'll fix it. Note that I'm going to take it out of the tree soon
because it will have conflicts with Jens for-5.4/block, so we
will send it to Jens after the initial merge window, after he
rebases off of Linus.
Conflicts too hard to fixup at merge time ? Otherwise I could just
rebase on top o
wrote:
+#define NVME_NVM_ADMSQES 6
#define NVME_NVM_IOSQES 6
#define NVME_NVM_IOCQES 4
The NVM in the two defines here stands for the NVM command set,
so this should just be named NVME_ADM_SQES or so. But except for
this
the patch looks good:
Reviewed-
I don't understand why we don't limit a regular ctrl to single access
and we do it for the PT ctrl.
I guess the block layer helps to sync between multiple access in
parallel but we can do it as well.
Also, let's say you limit the access to this subsystem to 1 user, the
bdev is still acces
Sagi,
Here are the test results.
Benchmark command:
fio --bs=4k --ioengine=libaio --iodepth=64
--filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
--direct=1 --runtime=90 --numjobs=80 --rw=randread --na
From: Long Li
When a NVMe hardware queue is mapped to several CPU queues, it is possible
that the CPU this hardware queue is bound to is flooded by returning I/O for
other CPUs.
For example, consider the following scenario:
1. CPU 0, 1, 2 and 3 share the same hardware queue
2. the hardware q
1 - 100 of 471 matches
Mail list logo