On Thu, Nov 07, 2013 at 11:57:33PM +0100, Jan Kara wrote: > On Thu 07-11-13 23:43:52, Frederic Weisbecker wrote: > > 2013/11/7 Jan Kara <j...@suse.cz>: > > > A CPU can be caught in console_unlock() for a long time (tens of seconds > > > are reported by our customers) when other CPUs are using printk heavily > > > and serial console makes printing slow. Despite serial console drivers > > > are calling touch_nmi_watchdog() this triggers softlockup warnings > > > because interrupts are disabled for the whole time console_unlock() runs > > > (e.g. vprintk() calls console_unlock() with interrupts disabled). Thus > > > IPIs cannot be processed and other CPUs get stuck spinning in calls like > > > smp_call_function_many(). Also RCU eventually starts reporting lockups. > > > > > > In my artifical testing I can also easily trigger a situation when disk > > > disappears from the system apparently because interrupt from it wasn't > > > served for too long. This is why just silencing watchdogs isn't a > > > reliable solution to the problem and we simply have to avoid spending > > > too long in console_unlock() with interrupts disabled. > > > > > > The solution this patch works toward is to postpone printing to a later > > > moment / different CPU when we already printed over X characters in > > > current console_unlock() invocation. This is a crude heuristic but > > > measuring time we spent printing doesn't seem to be really viable - we > > > cannot rely on high resolution time being available and with interrupts > > > disabled jiffies are not updated. User can tune the value X via > > > printk.offload_chars kernel parameter. > > > > > > Reviewed-by: Steven Rostedt <rost...@goodmis.org> > > > Signed-off-by: Jan Kara <j...@suse.cz> > > > > When a message takes tens of seconds to be printed, it usually means > > we are in trouble somehow :) > > I wonder what printk source can trigger such a high volume. > Machines with tens of processors and thousands of scsi devices. When > device discovery happens on boot, all processors are busily reporting new > scsi devices and one poor looser is bound to do the printing for ever and > ever until the machine dies... > > Or try running sysrq-t on a large machine with serial console connected. The > machine will die because of lockups (although in this case I agree it is more > of a problem of sysrq-t doing lots of printing in interrupt-disabled > context). > > > May be cutting some huge message into smaller chunks could help? That > > would re enable interrupts between each call. > > > > It's hard to tell without the context, but using other CPUs for > > rescuing doesn't look like a good solution. What if the issue happens > > in UP to begin with? > The real trouble in practice is that while one cpu is doing printing, > other cpus are appending to the printk buffer. So the cpu can be printing > for a *long* time. So offloading the work to other cpus which are also > appending messages seems as a fair thing to do.
Ok I see now. But then this irq_work based solution won't work if, say, you run in full dynticks mode. Also the hook on the timer interrupt is something that I wish we get rid of on archs that can trigger self-IPIs. Notwithstanding it's going to have scalibility issues as irq work then converges to a single list for unbound works. Offloading to a workqueue would be perhaps better, and writing to the serial console could then be done with interrupts enabled, preemptible context, etc... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/