Michael Ellerman <m...@ellerman.id.au> writes: > Nathan Lynch via B4 Relay <devnull+nathanl.linux.ibm....@kernel.org> writes: >> From: Nathan Lynch <nath...@linux.ibm.com> >> >> The kernel can handle retrying RTAS function calls in response to >> -2/990x in the sys_rtas() handler instead of relaying the intermediate >> status to user space. > > This looks good in general. > > One query ... > >> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c >> index 47a2aa43d7d4..c330a22ccc70 100644 >> --- a/arch/powerpc/kernel/rtas.c >> +++ b/arch/powerpc/kernel/rtas.c >> @@ -1798,7 +1798,6 @@ static bool block_rtas_call(int token, int nargs, >> /* We assume to be passed big endian arguments */ >> SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs) >> { >> - struct pin_cookie cookie; >> struct rtas_args args; >> unsigned long flags; >> char *buff_copy, *errbuf = NULL; >> @@ -1866,20 +1865,25 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, >> uargs) >> >> buff_copy = get_errorlog_buffer(); >> >> - raw_spin_lock_irqsave(&rtas_lock, flags); >> - cookie = lockdep_pin_lock(&rtas_lock); >> + do { >> + struct pin_cookie cookie; >> >> - rtas_args = args; >> - do_enter_rtas(&rtas_args); >> - args = rtas_args; >> + raw_spin_lock_irqsave(&rtas_lock, flags); >> + cookie = lockdep_pin_lock(&rtas_lock); >> >> - /* A -1 return code indicates that the last command couldn't >> - be completed due to a hardware error. */ >> - if (be32_to_cpu(args.rets[0]) == -1) >> - errbuf = __fetch_rtas_last_error(buff_copy); >> + rtas_args = args; >> + do_enter_rtas(&rtas_args); >> + args = rtas_args; >> >> - lockdep_unpin_lock(&rtas_lock, cookie); >> - raw_spin_unlock_irqrestore(&rtas_lock, flags); >> + /* >> + * Handle error record retrieval before releasing the lock. >> + */ >> + if (be32_to_cpu(args.rets[0]) == -1) >> + errbuf = __fetch_rtas_last_error(buff_copy); >> + >> + lockdep_unpin_lock(&rtas_lock, cookie); >> + raw_spin_unlock_irqrestore(&rtas_lock, flags); >> + } while (rtas_busy_delay(be32_to_cpu(args.rets[0]))); > > rtas_busy_delay_early() has the successive_ext_delays case that will > break out eventually. But if we keep getting plain RTAS_BUSY back from > RTAS I *think* this loop will never terminate?
Yes, but if this happens, then there is a serious bug in Linux or RTAS. The only time I've seen something like that on PowerVM is when Linux corrupted internal RTAS state by not serializing calls correctly. rtas_busy_delay_early() has a bail-out heuristic, not for RTAS_BUSY, but for extended delay statuses (990x), which I suspect happen rarely (if ever) that early. That's there in order to allow boot to proceed and hopefully get useful messages out in a truly unexpected circumstance. That said... > To avoid that, and just as good manners, I think we should have a > fatal_signal_pending() check, and if that returns true we bail out of > the syscall with -EINTR ? That probably makes sense. In its current state, I could see this patch preventing or delaying OS shutdown in situations where it wouldn't have occurred before. I think I would want the bailout condition in this case to be (fatal_signal_pending() && retries > some_threshold), to reduce the likelihood of non-"stuck" operations from being left unfinished. And it should dump a stack trace.