From: Jan Kara <j...@suse.cz> Hello,
here is another posting of the revived patch set to avoid lockups during heavy printing. Lately there were several attempts at dealing with softlockups due to heavy printk traffic [1] [2] and I've been also privately pinged by couple of people about the state of the patch set, so I've decided to revive the patch set. Patches (their older version) are present in SUSE enterprise kernels for several years and we didn't observe any issues with them. Patch set passes my stress testing where serial console is slowed down to print ~1000 chars per second and there are 100 delayed works printing together some 64k of text and in parallel modules are inserted which generates quite some additional messages, stop_machine() calls etc. Changes since v1: * printing kthreads now check for kthread_should_stop() * printing kthreads are now bound to CPUs so that scheduler cannot decide to schedule both kthreads on one CPU which effectively makes it impossible to hand over printing between them. This happened relatively frequently in virtual machines. * use printk buffer draining code in panic() to force all messages out when the system is dying * better naming of logbuf flushing functions suggested by AKPM * fixed irq safety of printing lock as pointed out by AKMP * fixed various smaller issues pointed by AKPM Changes since the the old patch set [3]: * I have replaced the state machine to pass printing and spinning on console_sem with a simple spinlock which makes the code somewhat easier to read and verify. * Some of the patches were merged so I dropped them. To remind the original problem: Currently, console_unlock() prints messages from kernel printk buffer to console while the buffer is non-empty. When serial console is attached, printing is slow and thus other CPUs in the system have plenty of time to append new messages to the buffer while one CPU is printing. Thus the CPU can spend unbounded amount of time doing printing in console_unlock(). This is especially serious when printk() gets called under some critical spinlock or with interrupts disabled. In practice users have observed a CPU can spend tens of seconds printing in console_unlock() (usually during boot when hundreds of SCSI devices are discovered) resulting in RCU stalls (CPU doing printing doesn't reach quiescent state for a long time), softlockup reports (IPIs for the printing CPU don't get served and thus other CPUs are spinning waiting for the printing CPU to process IPIs), and eventually a machine death (as messages from stalls and lockups append to printk buffer faster than we are able to print). So these machines are unable to boot with serial console attached. Also during artificial stress testing SATA disk disappears from the system because its interrupts aren't served for too long. This series addresses the problem in the following way: If CPU has printed more that printk_offload (defaults to 1000) characters, it wakes up one of dedicated printk kthreads (we don't use workqueue because that has deadlock potential if printk was called from workqueue code). Once we find out kthread is spinning on a lock, we stop printing, drop console_sem, and let kthread continue printing. Since there are two printing kthreads, they will pass printing between them and thus no CPU gets hogged by printing. Honza [1] https://lkml.org/lkml/2015/7/8/215 [2] http://marc.info/?l=linux-kernel&m=143929238407816&w=2 [3] https://lkml.org/lkml/2014/3/17/68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/