Michael Ellerman <m...@ellerman.id.au> writes: > Michael Roth <mdr...@linux.vnet.ibm.com> writes: >> Quoting Nathan Lynch (2020-08-07 02:05:09) > ... >>> wait_for_cpu_stopped() should be able to accommodate a time-based >>> warning if necessary, but speaking as a likely recipient of any bug >>> reports that would arise here, I'm not convinced of the need and I >>> don't know what a good value would be. It's relatively easy to sample >>> the stack of a task that's apparently failing to make progress, plus I >>> probably would use 'perf probe' or similar to report the inputs and >>> outputs for the RTAS call. >> >> I think if we make the timeout sufficiently high like 2 minutes or so >> it wouldn't hurt and if we did seem them it would probably point to an >> actual bug. But I don't have a strong feeling either way. > > I think we should print a warning after 2 minutes. > > It's true that there are fairly easy mechanisms to work out where the > thread is stuck, but customers are unlikely to use them. They're just > going to report that it's stuck with no further info, and probably > reboot the machine before we get a chance to get any further info. > > Whereas if the kernel prints a warning with a stack trace we at least > have that to go on in an initial bug report. > >>> I'm happy to make this a proper submission after I can clean it up and >>> retest it, or Michael R. is welcome to appropriate it, assuming it's >>> acceptable. >>> >> >> I've given it a shot with this patch and it seems to be holding up in >> testing. If we don't think the ~2 minutes warning message is needed I >> can clean it up to post: >> >> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e >> >> I'd likely break the refactoring patches out to a separate patch under >> Nathan's name since it fixes a separate bug potentially. > > While I like Nathan's refactoring, we probably want to do the minimal > fix first to ease backporting. > > Then do the refactoring on top of that.
Fair enough, thanks.