Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state

jianchao.wang Wed, 17 Jan 2018 19:15:06 -0800

Hi James

Thanks for your kindly and detailed response.
That's really appreciated!
And sorry for my bad description.


On 01/18/2018 05:01 AM, James Smart wrote:
> I'm having a hard time following why this patch is being requested. Help me 
> catch on.....
> 
Actually, this patchset is to fix a issue in nvme_timeout.
Please consider the following scenario.

nvme_reset_ctrl
    -> set state to RESETTING
    -> queue reset_work       
            (scheduling)
                     nvme_reset_work
                      -> nvme_dev_disable
                        -> quiesce queues
                          -> nvme_cancel_request 
                             on outstanding requests
 --------------------------------------------------------------------_boundary_
                      -> nvme initializing (may issue request on adminq)

Before the boundary, not only quiesce the queues, but only cancel all the 
outstanding requests.

A request could expire when the ctrl state is RESETTING.
 - If the timeout occur before the _boundary_, the expired requests are from 
the previous work.
 - Otherwise, the expired requests are from the controller initializing 
procedure, such as sending cq/sq  
   create commands to adminq to setup io queues.
In current implementation, nvme_timeout cannot identify the _boundary_ so only 
handles second case above.

This patchset is to identify the _boundary_ above, so introduce RESET_PREPARE.
Please refer to the following diagram.
nvme_reset_ctrl
    -> set state to RESET_PREPARE
    -> queue reset_work       
              (scheduling)
                     nvme_reset_work
                      -> nvme_dev_disable
                        -> quiesce queues
                          -> nvme_cancel_request 
                             on outstanding requests
                      -> set state to RESETTING   ----------> boundary
                      -> nvme initializing (may issue request on adminq)

Then we could identify the boundary by checking the ctrl-> state

After Sagi introduced d5bf4b7 (nvme-rdma: fix concurrent reset and reconnect), 
actually, same things as been done
in both nvme-fc/nvme.
- nvme_rdma_error_recovery_work
- nvme_rdma_reset_ctrl_work
- nvme_fc_reset_ctrl_work
But the state is from RESETTING to RECONNECTING.

So in the patch, RESETTING in nvme-fc/rdma is changed to RESET_PREPARE. Then we 
get:
nvme-fc/rdma RESET_PREPARE -> RECONNECTING -> LIVE
nvme-pci     RESET_PREPARE -> RESETTING    -> LIVE


> On 1/16/2018 8:54 PM, Jianchao Wang wrote:
>> Currently, the ctrl->state will be changed to NVME_CTRL_RESETTING
>> before queue the reset work. This is not so strict. There could be
>> a big gap before the reset_work callback is invoked.
> ok.... so what you're saying is you want something to know that you've 
> transitioned to RESETTING but not yet performed an action for the reset.   
> What is that something and what is it to do > 
> guessing - I assume it's in the transport  xxx_is_ready() and/or 
> nvmf_check_init_req(), which is called at the start of queue_rq(), that wants 
> to do something ?
> 

Should be explained above. :)

> 
>>   In addition,
>> there is some disable work in the reset_work callback, strictly
>> speaking, not part of reset work, and could lead to some confusion.
> 
> Can you explain this ?  what's the confusion ?
> 
> I assume by "disable" you mean it is quiescing queues ?

Sorry for my bad description. Not only quiesce queues, but also handle the 
outstanding requests. 

> 
> 
>>
>> In addition, after set state to RESETTING and disable procedure,
>> nvme-rdma/fc use NVME_CTRL_RECONNECTING to mark the setup and
>> reconnect procedure. The RESETTING state has been narrowed.
> 
> I still don't follow. Yes RECONNECTING is where we repetitively: try to 
> create a link-side association again: if it fails, delay and try again; or if 
> success, reinit the controller, and unquiesce all queues - allowing full 
> operation again, at which time we transition to LIVE.
> 
> by "narrowed" what do you mean ?    what "narrowed" ?

Currently , in nvme-pci, the RESETTING state includes both "shutdown" and 
"setup". But in nvme-fc/rdma, only "shutdown", "setup" is in RECONECTTING
shutdown - tear down queues/connections, quiesce queues, cancel requests ....
setup    - setup new IO queues or connections and some other initializing work 
....
> 
> In FC, as we have a lot of work that must occur to terminate io as part of 
> the reset, it can be a fairly long window.  I don't know that any 
> functionally in this path, regardless of time window, has narrowed.
> 
> 
>>
>> This patch add NVME_CTRL_RESET_PREPARE state to mark the reset_work
>> or error recovery work, scheduling gap and disable procedure.
>> After that,
>>   - For nvme-pci, nvmet-loop, set state to RESETTING, start
>>     initialization.
>>   - For nvme-rdma, nvme-fc, set state to RECONNECTING, start
>>     initialization or reconnect.
> 
> So I'm lost - so you've effectively renamed RESETTING to RESET_PREPARE for 
> fc/rdma.  What do you define are the actions in RESETTING that went away and 
> why is that different between pci and the other transports ?
As Sagi said, we could merge RECONNECTING and RESETTING into a unique state.

> Why doesn't nvme-pci need to go through RESET_PREPARE ? doesn't it have the 
> same scheduling window for a reset_work thread ?
Sure, nvme-pci also need to pass through RESET_PREPARE.

> 
> 
> On 1/17/2018 1:06 AM, Max Gurtovoy wrote:
>>
>>> +
>>> +    case NVME_CTRL_RESETTING:
>>> +        switch (old_state) {
>>> +        case NVME_CTRL_RESET_PREPARE:
>>> +            changed = true;
>>> +            /* FALLTHRU */
>>> +        default:
>>> +            break;
>>> +        }
>>> +        break;
>>>       case NVME_CTRL_RECONNECTING:
>>>           switch (old_state) {
>>>           case NVME_CTRL_LIVE:
>>> -        case NVME_CTRL_RESETTING:
>>> +        case NVME_CTRL_RESET_PREPARE:
>>
>> As I suggested in V3, please don't allow this transition.
>> We'll move to NVME_CTRL_RECONNECTING from NVME_CTRL_RESETTING.
>>
>> I look on it like that:
>>
>> NVME_CTRL_RESET_PREPARE - "suspend" state
>> NVME_CTRL_RESETTING - "resume" state
>>
>> you don't reconnect from "suspend" state, you must "resume" before you 
>> reconnect.
> 
> This makes no sense to me.
> 
> I could use a definition of what "suspend" and "resume" mean to you....
> 
> from what I've seen so far:
> NVME_CTRL_RESET_PREPARE:   means I've decided to reset, changed state, but 
> the actual work for reset hasn't started yet.   As we haven't commonized who 
> does the quiescing of the queues, the queues are still live at this state, 
> although some nvme check routine may bounce them. In truth, the queues should 
> be quiesced here.
> 
> NVME_CTRL_RESETTING: I'm resetting the controller, tearing down 
> queues/connections, the link side association.  AFAIK - pci and all the other 
> transports have to do things things.   Now is when the blk-mq queues get 
> stopped.   We have a variance on whether the queues are unquiesced or left 
> quiesced (I think this is what you meant by "resume", where resume means 
> unquiesce) at the end of this.   The admin_q is unquiesced, meaning new admin 
> cmds should fail.  rdma also has io queues unquiesced meaning new ios fail, 
> while fc leaves them quiesced while background timers run - meaning no new 
> ios issued, nor any fail back to a multipather. With the agreement that we 
> would patch all of the transports to leave them quiesced with 
> fast-fail-timeouts occuring to unquiesce them and start failing ios.
> 
> NVME_RECONNECTING: transitioned to after the link-side association is 
> terminated and the transport will now attempt to reconnect (perhaps several 
> attempts) to create a new link-side association. Stays in this state until 
> the controller is fully reconnected and it transitions to NVME_LIVE.   Until 
> the link side association is active, queues do what they do (as left by 
> RESETTING and/or updated per timeouts) excepting that after an active 
> assocation, they queues will be unquiesced at the time of the LIVE 
> transition.   Note: we grandfathered PCI into not needing this state:   As 
> you (almost) can't fail the establishment of the link-side association as 
> it's a PCI function and registers so unless the device has dropped off the 
> bus, you can immediately talk to registers and start to reinit the 
> controller.  Given it was almost impossible not to succeed, and as the code 
> path already existed, we didn't see a reason to make pci change.

Thanks for your directive here. That's really appreciated.

> 
> 
> On 1/17/2018 1:19 AM, jianchao.wang wrote:
>>
>> I used to respond you in the V3 and wait for your feedback.
>> Please refer to:
>> After Sagi's nvme-rdma: fix concurrent reset and reconnect, the rdma ctrl 
>> state is changed to RECONNECTING state
>> after some clearing and shutdown work, then some initializing procedure,  no 
>> matter reset work path or error recovery path.
>> The fc reset work also does the same thing.
>> So if we define the range that RESET_PREPARE includes scheduling gap and 
>> disable and clear work, RESETTING includes initializing
>> procedure,  RECONNECTING is very similar with RESETTING.
> 
> I get the impression we have very different understandings of what occurs 
> under what states. I'd like to see a summary of what you think that is as 
> this is what we must agree upon.
> 
> Of course, if you move the gap, disable, and clear work into RESET_PREPARE, 
> you've taken most of the RESETTING state away. What was left ?
Things should be cleared on the top.  :)

> > I don't believe RESETTING and RECONNECTING are that similar - unless - you 
> > are looking at that "reinit" part that we left grandfathered into PCI.
Yes. I have got the point. Thanks for your directive.

Both nvme-pc and nvme-fc/rdma have "shutdown" part to tear down 
queues/connects, quiesce queues, cancel outstanding requests...
We could call this part as "shutdowning" as well as the scheduling gap.
Because the difference of initializing between the nvme-pci and nvme-fc/rdma,  
we could call "reiniting" for nvme-pci and
call "reconnecting" for nvme-fc/rdma


> 
> 
> On 1/17/2018 1:24 AM, jianchao.wang wrote:
>>
>>> Maybe we could do like this;
>>> In nvme fc/rdma
>>> - set state to RESET_PREPARE, queue reset_work/err_work
>>> - clear/shutdown works, set state to RECONNECTING
>>> - initialization, set state to LIVE
>>>
>>> In nvme pci
>>> - set state to RESET_PREPARE, queue reset_work
>>> - clear/shutdown works, set state to RESETTING
>>> - initialization, set state to LIVE
>>> Currently, RECONNECTING has overlapped with RESETTING.
>>> So I suggest to use RESET_PREPARE to mark the "suspend" part as you said.
>>> And use RECONNECTING to mark the "resume" part in nvme-rdma/fc
>>> use RESETTING part to mark the "resume" part in nvme-pci, nvmt-loop.
>> I have added a comment above the definition of nvme_ctrl_state as below:
>>    +/*
>> + * RESET_PREPARE - mark the state of scheduling gap and disable procedure
>> + * RESETTING     - nvme-pci, nvmet loop use it to mark the state of setup
>> + *                   procedure
>> + * RECONNECTING  - nvme-fc, nvme-rdma use it to mark the state of setup
>> + *                   and reconnect procedure
>> + */
>>
>> Consequently, nvme-fc/rdma don't use RESETTING state any more, but only 
>> RESET_PREPARE.
> 
> So now I'm confused again...  The purpose of the new state was to have a 
> state indicator to cover the transition into the state while the reset_work 
> item was yet to be scheduled. If all you've done is change the state name on 
> fc/rdma - don't you still have that window ?

Things should be cleared on the top.  :)


Many thanks again.

> 
> -- james
> 
> 
> 
>

Re: [PATCH V4 1/2] nvme: add NVME_CTRL_RESET_PREPARE state

Reply via email to