On 2/25/2026 5:42 PM, Bjorn Andersson wrote:
> On Mon, Feb 23, 2026 at 03:43:56PM -0600, Shah, Tanmay wrote:
>> Hello,
>>
>> Thank you for the reviews. My response below:
>>
>> On 2/23/2026 1:27 PM, Bjorn Andersson wrote:
>>> On Mon, Feb 23, 2026 at 10:50:05AM -0800, Tanmay Shah wrote:
>>>> Current attach on recovery mechanism loads the clean resource table
>>>> during recovery, but doesn't re-allocate the resources. RPMsg
>>>> communication will fail after recovery due to this. Fix this
>>>> incorrect behavior by doing the full detach and attach of remote
>>>> processor during the recovery. This will load the clean resource table
>>>> and re-allocate all the resources, which will set up correct vring
>>>> information in the resource table.
>>>>
>>>> Signed-off-by: Tanmay Shah <[email protected]>
>>>> ---
>>>>
>>>> Changes in v3:
>>>>  - both rproc_attach_recovery() and
>>>>    rproc_boot_recovery() are called the same way.
>>>>  - remove unrelated changes
>>>>
>>>> Changes in v2:
>>>>  - use rproc_boot instead of rproc_attach
>>>>  - move debug message early in the function
>>>>
>>>>  drivers/remoteproc/remoteproc_core.c | 33 +++++++++++-----------------
>>>>  1 file changed, 13 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/drivers/remoteproc/remoteproc_core.c 
>>>> b/drivers/remoteproc/remoteproc_core.c
>>>> index aada2780b343..790ad7c6d12e 100644
>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>> @@ -1777,11 +1777,11 @@ static int rproc_attach_recovery(struct rproc 
>>>> *rproc)
>>>>  {
>>>>    int ret;
>>>>  
>>>> -  ret = __rproc_detach(rproc);
>>>> +  ret = rproc_detach(rproc);
>>>>    if (ret)
>>>>            return ret;
>>>>  
>>>> -  return __rproc_attach(rproc);
>>>> +  return rproc_boot(rproc);
>>>>  }
>>>>  
>>>>  static int rproc_boot_recovery(struct rproc *rproc)
>>>> @@ -1790,10 +1790,14 @@ static int rproc_boot_recovery(struct rproc *rproc)
>>>>    struct device *dev = &rproc->dev;
>>>>    int ret;
>>>>  
>>>> -  ret = rproc_stop(rproc, true);
>>>> +  ret = mutex_lock_interruptible(&rproc->lock);
>>>>    if (ret)
>>>>            return ret;
>>>>  
>>>> +  ret = rproc_stop(rproc, true);
>>>> +  if (ret)
>>>> +          goto unlock_mutex;
>>>> +
>>>>    /* generate coredump */
>>>>    rproc->ops->coredump(rproc);
>>>>  
>>>> @@ -1801,7 +1805,7 @@ static int rproc_boot_recovery(struct rproc *rproc)
>>>>    ret = request_firmware(&firmware_p, rproc->firmware, dev);
>>>>    if (ret < 0) {
>>>>            dev_err(dev, "request_firmware failed: %d\n", ret);
>>>> -          return ret;
>>>> +          goto unlock_mutex;
>>>>    }
>>>>  
>>>>    /* boot the remote processor up again */
>>>> @@ -1809,6 +1813,8 @@ static int rproc_boot_recovery(struct rproc *rproc)
>>>>  
>>>>    release_firmware(firmware_p);
>>>>  
>>>> +unlock_mutex:
>>>> +  mutex_unlock(&rproc->lock);
>>>>    return ret;
>>>>  }
>>>>  
>>>> @@ -1827,26 +1833,13 @@ static int rproc_boot_recovery(struct rproc *rproc)
>>>>  int rproc_trigger_recovery(struct rproc *rproc)
>>>>  {
>>>>    struct device *dev = &rproc->dev;
>>>> -  int ret;
>>>> -
>>>> -  ret = mutex_lock_interruptible(&rproc->lock);
>>>> -  if (ret)
>>>> -          return ret;
>>>> -
>>>> -  /* State could have changed before we got the mutex */
>>>> -  if (rproc->state != RPROC_CRASHED)
>>>> -          goto unlock_mutex;
>>>>  
>>>>    dev_err(dev, "recovering %s\n", rproc->name);
>>>>  
>>>>    if (rproc_has_feature(rproc, RPROC_FEAT_ATTACH_ON_RECOVERY))
>>>> -          ret = rproc_attach_recovery(rproc);
>>>> +          return rproc_attach_recovery(rproc);
>>>
>>> rproc_trigger_recovery() can be called either from scheduled work or
>>> directly from the debugfs/sysfs interface, it doesn't seem safe to me to
>>> call rproc_attach_recovery() without ensuring mutual exclusion between
>>> multiple parallel callers.
>>>
>>
>> I think mutual exclusion is still maintained.
>>
>>> In fact, I can see the relationship between the commit message and the
>>> changes in rproc_attach_recovery() and rproc_detach(), but I'm not sure
>>> why you need to change rproc_boot_recovery() and
>>> rproc_trigger_recovery(). Perhaps you're just missing some explanation
>>> in the commit message?
>>>
>>
>> Here, I am refactoring how lock is used and that is why I have to modify
>> rproc_trigger_recovery() and rproc_boot_recovery().
>>
>> Before:
>>
>> rproc_trigger_recovery() -> lock() -> __rproc_detach() /
>> rproc_boot_recovery() -> unlock()
>>
>> Now, __rproc_detach is replaced with rproc_detach(), which already has
>> mutual exclusion implemented within the call.
>>
>> After:
>>
>> 1) for attach recovery
>> rproc_trigger_recovery() -> rproc_attach_recovery() -> rproc_detach() ->
>> lock() -> ... -> unlock() -> rproc_boot() -> lock() ... -> unlock()
>>
> 
> The concern I had was that we're letting others execute inbetween
> rproc_detach() and rproc_boot(). But perhaps that's okay?
> 
>> 2) To call rproc_attach_recovery() and rproc_boot_recovery() in the same
>> manner, I modified rproc_boot_recovery() and introduced mutual exclusion
>> around it.
>>
> 
> No, in rproc_boot_recovery() you're locking around stop and start. In
> rproc_attach_recovery() you're locking within each part.
> 
> Looking more at this, I dislike the asymmetry between rproc_detach() vs
> rproc_boot(), and how I presume this then relies on rproc_boot() mostly
> just calling rproc_attach().
> 
> This seems to come from the fact that rproc_attach() and rproc_start()
> aren't symmetrical and rproc_detach() and rproc_stop() aren't.
> 
> It would be good if we could get this lined up, so that it's not so hard
> to reason about the different code paths through the core.
> 
> 
> Regardless of this though, the removal of the check for state !=
> RPROC_CRASHED (under a lock) means that inbetween
> rproc_crash_handler_work() and rproc_trigger_recovery() someone can
> write "recover" into the "recovery" debugfs entry and we will recover
> twice.
> 

Okay, to address this concern I will maintain original code in the
rproc_trigger_reocvery().

In, rproc_attach_recovery() I won't be calling rproc_detach() and
rproc_boot(). Instead I will programming sequence needed from
rproc_detach(), and then call rproc_attach() directly. This will help
maintain mutual exclusion same as before during the recovery process.

This will duplicate some code, but that can be refactored later, when
asymmetry of rproc_detach() vs rproc_stop() will be fixed.

I hope that's fine. I am open to any other proposal as well.

If you prefer to see v4, before further reviews let me know.

Thanks,
Tanmay


> Regards,
> Bjorn
> 
>> If you prefer, I can add commit message explaining this change. This is
>> only refactoring of the code and no new feature though.
>> Let me know if something is still missing in the implementation or in
>> the above explanation.
>>
>> Thank You,
>> Tanmay
>>
>>> Regards,
>>> Bjorn
>>>
>>>>    else
>>>> -          ret = rproc_boot_recovery(rproc);
>>>> -
>>>> -unlock_mutex:
>>>> -  mutex_unlock(&rproc->lock);
>>>> -  return ret;
>>>> +          return rproc_boot_recovery(rproc);
>>>>  }
>>>>  
>>>>  /**
>>>> @@ -2057,7 +2050,7 @@ int rproc_detach(struct rproc *rproc)
>>>>            return ret;
>>>>    }
>>>>  
>>>> -  if (rproc->state != RPROC_ATTACHED) {
>>>> +  if (rproc->state != RPROC_ATTACHED && rproc->state != RPROC_CRASHED) {
>>>>            ret = -EINVAL;
>>>>            goto out;
>>>>    }
>>>> -- 
>>>> 2.34.1
>>>>
>>


Reply via email to