Hi James
Thanks for your kindly and detailed response.
That's really appreciated!
And sorry for my bad description.
On 01/18/2018 05:01 AM, James Smart wrote:
> I'm having a hard time following why this patch is being requested. Help me
> catch on.....
>
Actually, this patchset is to fix a issue in nvme_timeout.
Please consider the following scenario.
nvme_reset_ctrl
-> set state to RESETTING
-> queue reset_work
(scheduling)
nvme_reset_work
-> nvme_dev_disable
-> quiesce queues
-> nvme_cancel_request
on outstanding requests
--------------------------------------------------------------------_boundary_
-> nvme initializing (may issue request on adminq)
Before the boundary, not only quiesce the queues, but only cancel all the
outstanding requests.
A request could expire when the ctrl state is RESETTING.
- If the timeout occur before the _boundary_, the expired requests are from
the previous work.
- Otherwise, the expired requests are from the controller initializing
procedure, such as sending cq/sq
create commands to adminq to setup io queues.
In current implementation, nvme_timeout cannot identify the _boundary_ so only
handles second case above.
This patchset is to identify the _boundary_ above, so introduce RESET_PREPARE.
Please refer to the following diagram.
nvme_reset_ctrl
-> set state to RESET_PREPARE
-> queue reset_work
(scheduling)
nvme_reset_work
-> nvme_dev_disable
-> quiesce queues
-> nvme_cancel_request
on outstanding requests
-> set state to RESETTING ----------> boundary
-> nvme initializing (may issue request on adminq)
Then we could identify the boundary by checking the ctrl-> state
After Sagi introduced d5bf4b7 (nvme-rdma: fix concurrent reset and reconnect),
actually, same things as been done
in both nvme-fc/nvme.
- nvme_rdma_error_recovery_work
- nvme_rdma_reset_ctrl_work
- nvme_fc_reset_ctrl_work
But the state is from RESETTING to RECONNECTING.
So in the patch, RESETTING in nvme-fc/rdma is changed to RESET_PREPARE. Then we
get:
nvme-fc/rdma RESET_PREPARE -> RECONNECTING -> LIVE
nvme-pci RESET_PREPARE -> RESETTING -> LIVE
> On 1/16/2018 8:54 PM, Jianchao Wang wrote:
>> Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
>> before queue the reset work. This is not so strict. There could be
>> a big gap before the reset_work callback is invoked.
> ok.... so what you're saying is you want something to know that you've
> transitioned to RESETTING but not yet performed an action for the reset.
> What is that something and what is it to do >
> guessing - I assume it's in the transport xxx_is_ready() and/or
> nvmf_check_init_req(), which is called at the start of queue_rq(), that wants
> to do something ?
>
Should be explained above. :)
>
>> In addition,
>> there is some disable work in the reset_work callback, strictly
>> speaking, not part of reset work, and could lead to some confusion.
>
> Can you explain this ? what's the confusion ?
>
> I assume by "disable" you mean it is quiescing queues ?
Sorry for my bad description. Not only quiesce queues, but also handle the
outstanding requests.
>
>
>>
>> In addition, after set state to RESETTING and disable procedure,
>> nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and
>> reconnect procedure. The RESETTING state has been narrowed.
>
> I still don't follow. Yes RECONNECTING is where we repetitively: try to
> create a link-side association again: if it fails, delay and try again; or if
> success, reinit the controller, and unquiesce all queues - allowing full
> operation again, at which time we transition to LIVE.
>
> by "narrowed" what do you mean ? what "narrowed" ?
Currently , in nvme-pci, the RESETTING state includes both "shutdown" and
"setup". But in nvme-fc/rdma, only "shutdown", "setup" is in RECONECTTING
shutdown - tear down queues/connections, quiesce queues, cancel requests ....
setup - setup new IO queues or connections and some other initializing work
....
>
> In FC, as we have a lot of work that must occur to terminate io as part of
> the reset, it can be a fairly long window. I don't know that any
> functionally in this path, regardless of time window, has narrowed.
>
>
>>
>> This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work
>> or error recovery work, scheduling gap and disable procedure.
>> After that,
>> - For nvme-pci, nvmet-loop, set state to RESETTING, start
>> initialization.
>> - For nvme-rdma, nvme-fc, set state to RECONNECTING, start
>> initialization or reconnect.
>
> So I'm lost - so you've effectively renamed RESETTING to RESET_PREPARE for
> fc/rdma. What do you define are the actions in RESETTING that went away and
> why is that different between pci and the other transports ?
As Sagi said, we could merge RECONNECTING and RESETTING into a unique state.
> Why doesn't nvme-pci need to go through RESET_PREPARE ? doesn't it have the
> same scheduling window for a reset_work thread ?
Sure, nvme-pci also need to pass through RESET_PREPARE.
>
>
> On 1/17/2018 1:06 AM, Max Gurtovoy wrote:
>>
>>> +
>>> + case NVME_CTRL_RESETTING:
>>> + switch (old_state) {
>>> + case NVME_CTRL_RESET_PREPARE:
>>> + changed = true;
>>> + /* FALLTHRU */
>>> + default:
>>> + break;
>>> + }
>>> + break;
>>> case NVME_CTRL_RECONNECTING:
>>> switch (old_state) {
>>> case NVME_CTRL_LIVE:
>>> - case NVME_CTRL_RESETTING:
>>> + case NVME_CTRL_RESET_PREPARE:
>>
>> As I suggested in V3, please don't allow this transition.
>> We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING.
>>
>> I look on it like that:
>>
>> NVME_CTRL_RESET_PREPARE - "suspend" state
>> NVME_CTRL_RESETTING - "resume" state
>>
>> you don't reconnect from "suspend" state, you must "resume" before you
>> reconnect.
>
> This makes no sense to me.
>
> I could use a definition of what "suspend" and "resume" mean to you....
>
> from what I've seen so far:
> NVME_CTRL_RESET_PREPARE: means I've decided to reset, changed state, but
> the actual work for reset hasn't started yet. As we haven't commonized who
> does the quiescing of the queues, the queues are still live at this state,
> although some nvme check routine may bounce them. In truth, the queues should
> be quiesced here.
>
> NVME_CTRL_RESETTING: I'm resetting the controller, tearing down
> queues/connections, the link side association. AFAIK - pci and all the other
> transports have to do things things. Now is when the blk-mq queues get
> stopped. We have a variance on whether the queues are unquiesced or left
> quiesced (I think this is what you meant by "resume", where resume means
> unquiesce) at the end of this. The admin_q is unquiesced, meaning new admin
> cmds should fail. rdma also has io queues unquiesced meaning new ios fail,
> while fc leaves them quiesced while background timers run - meaning no new
> ios issued, nor any fail back to a multipather. With the agreement that we
> would patch all of the transports to leave them quiesced with
> fast-fail-timeouts occuring to unquiesce them and start failing ios.
>
> NVME_RECONNECTING: transitioned to after the link-side association is
> terminated and the transport will now attempt to reconnect (perhaps several
> attempts) to create a new link-side association. Stays in this state until
> the controller is fully reconnected and it transitions to NVME_LIVE. Until
> the link side association is active, queues do what they do (as left by
> RESETTING and/or updated per timeouts) excepting that after an active
> assocation, they queues will be unquiesced at the time of the LIVE
> transition. Note: we grandfathered PCI into not needing this state: As
> you (almost) can't fail the establishment of the link-side association as
> it's a PCI function and registers so unless the device has dropped off the
> bus, you can immediately talk to registers and start to reinit the
> controller. Given it was almost impossible not to succeed, and as the code
> path already existed, we didn't see a reason to make pci change.
Thanks for your directive here. That's really appreciated.
>
>
> On 1/17/2018 1:19 AM, jianchao.wang wrote:
>>
>> I used to respond you in the V3 and wait for your feedback.
>> Please refer to:
>> After Sagi's nvme-rdma: fix concurrent reset and reconnect, the rdma ctrl
>> state is changed to RECONNECTING state
>> after some clearing and shutdown work, then some initializing procedure, no
>> matter reset work path or error recovery path.
>> The fc reset work also does the same thing.
>> So if we define the range that RESET_PREPARE includes scheduling gap and
>> disable and clear work, RESETTING includes initializing
>> procedure, RECONNECTING is very similar with RESETTING.
>
> I get the impression we have very different understandings of what occurs
> under what states. I'd like to see a summary of what you think that is as
> this is what we must agree upon.
>
> Of course, if you move the gap, disable, and clear work into RESET_PREPARE,
> you've taken most of the RESETTING state away. What was left ?
Things should be cleared on the top. :)
> > I don't believe RESETTING and RECONNECTING are that similar - unless - you
> > are looking at that "reinit" part that we left grandfathered into PCI.
Yes. I have got the point. Thanks for your directive.
Both nvme-pc and nvme-fc/rdma have "shutdown" part to tear down
queues/connects, quiesce queues, cancel outstanding requests...
We could call this part as "shutdowning" as well as the scheduling gap.
Because the difference of initializing between the nvme-pci and nvme-fc/rdma,
we could call "reiniting" for nvme-pci and
call "reconnecting" for nvme-fc/rdma
>
>
> On 1/17/2018 1:24 AM, jianchao.wang wrote:
>>
>>> Maybe we could do like this;
>>> In nvme fc/rdma
>>> - set state to RESET_PREPARE, queue reset_work/err_work
>>> - clear/shutdown works, set state to RECONNECTING
>>> - initialization, set state to LIVE
>>>
>>> In nvme pci
>>> - set state to RESET_PREPARE, queue reset_work
>>> - clear/shutdown works, set state to RESETTING
>>> - initialization, set state to LIVE
>>> Currently, RECONNECTING has overlapped with RESETTING.
>>> So I suggest to use RESET_PREPARE to mark the "suspend" part as you said.
>>> And use RECONNECTING to mark the "resume" part in nvme-rdma/fc
>>> use RESETTING part to mark the "resume" part in nvme-pci, nvmt-loop.
>> I have added a comment above the definition of nvme_ctrl_state as below:
>> +/*
>> + * RESET_PREPARE - mark the state of scheduling gap and disable procedure
>> + * RESETTING - nvme-pci, nvmet loop use it to mark the state of setup
>> + * procedure
>> + * RECONNECTING - nvme-fc, nvme-rdma use it to mark the state of setup
>> + * and reconnect procedure
>> + */
>>
>> Consequently, nvme-fc/rdma don't use RESETTING state any more, but only
>> RESET_PREPARE.
>
> So now I'm confused again... The purpose of the new state was to have a
> state indicator to cover the transition into the state while the reset_work
> item was yet to be scheduled. If all you've done is change the state name on
> fc/rdma - don't you still have that window ?
Things should be cleared on the top. :)
Many thanks again.
>
> -- james
>
>
>
>