On Wed 16-01-13 14:50:05, Andrew Morton wrote: > > > > We fix the issue by printing at most 1 KB of messages (unless we are in > > > > an > > > > early boot stage or oops is happening) in one console_unlock() call. > > > > The rest > > > > of the buffer will be printed either by further callers to printk() or > > > > by a > > > > queued work. > > > > > > Complex. Did you try just putting a touch_nmi_watchdog() in the loop? > > I didn't try that. I suppose touch_nmi_watchdog() + > > rcu_cpu_stall_reset() would make the messages go away (yes, RCU eventually > > freaks out as well). But is it really sane that we keep single CPU busy, > > unusable for anything else, for such a long time? There can be no RCU > > callbacks processed, no IPIs are processed (which is what triggers > > softlockup reports), etc. > > What's not sane is doing large amounts of printk over a slow device. Yeah, but serial consoles are handy... These sysadmins are pretty religious about them and when the machine doesn't boot with serial console enabled they complain. Nitpickers!
> > I agree that if we silence all the warnings, everything will eventually > > hang waiting for the stalled CPU, that will finish the printing and things > > start from the beginning (we tried silencing RCU with rcu_cpu_stall_reset() > > and that makes the machine boot eventually). But it seems like papering > > over a real problem? > > Well not really - we're doing what the printk() caller asked us to do - > to synchronously print stuff. And simply sitting there pumping out the > characters is the simplest, most straightforward thing to do. And > printk() should be simple and straightforward. Except that printk() isn't always synchronous. It writes a message into the kernel buffer. If there is noone else pumping characters from the buffer into console at that moment, the printk caller starts doing so. But otherwise printk() just returns and pumping of characters is left on whoever started doing that thankless job. So that poor guy ends up doing the pumping for all the others... > If this is all a problem then the calling code should stop doing so > much printing! It's mostly a device discovery that triggers the issues in practice. They have over thousand of SCSI disks attached (multipath in use) and when each disk prints ~400 bytes of messages (just check your dmesg) you end up with ~30s worth of printing over 115200 console. > And punting the operation to a kernel thread is a pretty radical change > - it surely adds significant risk that output will be lost. I agree there is a higher chance the output will be lost. > So hrm, I dunno. Can we just put the touch_nmi_watchdog() in there > intially, see if it fixes things? If people continue to hit problems > then we can take a second look? OK, I'll see if I can get this tested on one of those machines... Honza -- Jan Kara <j...@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/