On Tue, Mar 22, 2016 at 06:59:32PM -0700, Paul E. McKenney wrote: > On Tue, Mar 22, 2016 at 04:53:26PM -0700, Bart Van Assche wrote: > > On 03/22/2016 01:45 PM, Paul E. McKenney wrote: > > >You are getting a soft lockup as well as an RCU CPU stall warning, so > > >it looks like something is taking a very long time in blk_done_softirq(). > > > > > >You have multiple occurrences at different times, so it looks to be > > >a long time as opposed to an infinite time. Are you perhaps doing > > >something that would make a huge amount of work for blk_done_softirq()? > > > > > >See Documentation/RCU/stallwarn.txt in the kernel source tree for more > > >info on how to debug this sort of thing. > > > > Hello Paul, > > > > None of the drivers involved in the test I ran contain RCU code that > > has been changed recently. The block and SCSI subsystems processes > > I/O completions in softirq context but until last week I hadn't seen > > any RCU lockup complaints when I ran an SRP test against a kernel > > with lockdep and several other kernel debugging options enabled. > > This is why I sent an e-mail to you. I have read > > Documentation/RCU/stallwarn.txt after I received your reply but this > > didn't provide me any clue about where to look for the root cause. > > Any further help would be appreciated. > > My suggestion would be to check the block/SCSI softirq handler for > event traces. If there are some, enable them and see what the loop > is doing. Documentation/trace/ftrace.txt describes how to enable > existing event tracing. > > If there is no event tracing, consider adding some in your local > view. Failing that, there is always printk(). ;-) > > Or perhaps you have some sort of debug setup. > > Either way, the next step is to work out why that CPU is spending > so much time in that loop.
And the dmesg leading up to the stall might have some clues. Note that a soft lockup triggered at 10509.568010, well before the RCU CPU stall warning.. And you have a second soft lockup at 10537.567212, with the same funtion scsi_request_fn() at the top of the stack in both stack traces. That function has a nice big "for (;;)" loop that does not appear to have any iteration-limiting mechanism. (Though perhaps there is such a mechanism implemented in one of the called functions, but that would be something for you to look into.) As you saw when reading stallwarn.txt, having a too-long loop in the kernel is a good way to get RCU CPU stall warnings. Also, before the soft lockups, you have a bunch of FAIL indications and other nasty-looking error messages. Might you have some sort of configuration or hardware problem? Thanx, Paul