Hi! > > > A CPU can be caught in console_unlock() for a long time (tens of seconds > > > are reported by our customers) when other CPUs are using printk heavily > > > and serial console makes printing slow. Despite serial console drivers > > > are calling touch_nmi_watchdog() this triggers softlockup warnings > > > because interrupts are disabled for the whole time console_unlock() runs > > > (e.g. vprintk() calls console_unlock() with interrupts disabled). Thus > > > IPIs cannot be processed and other CPUs get stuck spinning in calls like > > > smp_call_function_many(). Also RCU eventually starts reporting lockups. > > > > > > In my artifical testing I can also easily trigger a situation when disk > > > disappears from the system apparently because interrupt from it wasn't > > > served for too long. This is why just silencing watchdogs isn't a > > > reliable solution to the problem and we simply have to avoid spending > > > too long in console_unlock() with interrupts disabled. > > > > > > The solution this patch works toward is to postpone printing to a later > > > moment / different CPU when we already printed over X characters in > > > current console_unlock() invocation. This is a crude heuristic but > > > measuring time we spent printing doesn't seem to be really viable - we > > > cannot rely on high resolution time being available and with interrupts > > > disabled jiffies are not updated. User can tune the value X via > > > printk.offload_chars kernel parameter. > > > > > > Reviewed-by: Steven Rostedt <rost...@goodmis.org> > > > Signed-off-by: Jan Kara <j...@suse.cz> > > > > When a message takes tens of seconds to be printed, it usually means > > we are in trouble somehow :) > > I wonder what printk source can trigger such a high volume. > Machines with tens of processors and thousands of scsi devices. When > device discovery happens on boot, all processors are busily reporting new > scsi devices and one poor looser is bound to do the printing for ever and > ever until the machine dies...
Dunno. In these cases, would it make sense to: 1) reduce amount of text printed 2) just print [XXX characters lost] on overruns? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/