On Wed, Nov 19, 2014 at 09:59:02AM -0500, Dave Jones wrote: > On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote: > > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <da...@redhat.com> wrote: > > > > > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480] > > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 > [loadavg: 174.61 150.35 148.64 9/411 32140] > > > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] > context_tracking_user_enter+0xa4/0x190 > > > Call Trace: > > > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160 > > > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d > > > > Hmm, if we are getting soft-lockups here, maybe it suggest too much > exit-work. > > > > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you? > > > > That makes me wonder: does the problem go away if you disable NOHZ? > > Aparently not. > > NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175] > CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: > 168.72 151.72 150.38 9/410 27945] > task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000 > RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] > context_tracking_user_exit+0x57/0x120 > RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47 > RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725 > RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503 > R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48 > FS: 00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25 > 00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000 > 0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000 > Call Trace: > [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0 > [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [<ffffffff947d41bf>] tracesys+0x14/0x4a > Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 > f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f > 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80 > sending NMI to other CPUs: > NMI backtrace for cpu 1 > CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: > 168.72 151.72 150.38 9/410 27945] > task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000 > RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] > generic_exec_single+0xee/0x1a0 > RSP: 0018:ffff8801a99a7d18 EFLAGS: 00000202 > RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038 > RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000 > RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003 > R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0 > FS: 00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948 > 0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff > 0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001 > Call Trace: > [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120 > [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120 > [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
One thing that happens a lot in your crashes is a CPU sending IPIs. Maybe stuck polling on csd->lock or something. But's it's not the CPU that soft lockups. At least not the first that gets reported. > [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100 > [<ffffffff941671aa>] perf_event_read+0xca/0xd0 > [<ffffffff94167240>] perf_event_read_value+0x90/0xe0 > [<ffffffff941689c6>] perf_read+0x226/0x370 > [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0 > [<ffffffff941eafff>] vfs_read+0x9f/0x180 > [<ffffffff941ebbd8>] SyS_read+0x58/0xd0 > [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/