On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users wrote: > Just got this: > Message from syslogd@gcc102 at Nov 30 04:31:20 ... > kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2 > stuck for 48s! [ppc2:203070] > > Can I do anything to help figuring out the problem?
Not sure. Ideas are certainly welcome. Had you used this machine much before a few days ago? This issue had occurred maybe twice in the last two years. Here is what is being printed, over and over again: [60140.204902] watchdog: BUG: soft lockup - CPU#2 stuck for 11919s! [ppc2:203070] [60140.608860] watchdog: BUG: soft lockup - CPU#136 stuck for 10914s! [in:imklog:2885] [60148.356060] watchdog: BUG: soft lockup - CPU#54 stuck for 11700s! [sshd:103658] [60152.803603] watchdog: BUG: soft lockup - CPU#195 stuck for 10530s! [exim4:3249] [60160.346825] watchdog: BUG: soft lockup - CPU#51 stuck for 11830s! [kworker/u512:1:103663] [60164.202428] watchdog: BUG: soft lockup - CPU#2 stuck for 11942s! [ppc2:203070] [60164.398409] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [60164.410066] rcu: 65-...0: (1 GPs behind) idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1067360 [60164.428982] rcu: 175-...0: (6 ticks this GP) idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1067361 [60164.606387] watchdog: BUG: soft lockup - CPU#136 stuck for 10936s! [in:imklog:2885] [60172.353588] watchdog: BUG: soft lockup - CPU#54 stuck for 11722s! [sshd:103658] [60176.801129] watchdog: BUG: soft lockup - CPU#195 stuck for 10553s! [exim4:3249] [60184.344352] watchdog: BUG: soft lockup - CPU#51 stuck for 11853s! [kworker/u512:1:103663] [60188.199955] watchdog: BUG: soft lockup - CPU#2 stuck for 11964s! [ppc2:203070] [60188.603913] watchdog: BUG: soft lockup - CPU#136 stuck for 10959s! [in:imklog:2885] [60196.351113] watchdog: BUG: soft lockup - CPU#54 stuck for 11745s! [sshd:103658] [60200.798657] watchdog: BUG: soft lockup - CPU#195 stuck for 10575s! [exim4:3249] [60208.341880] watchdog: BUG: soft lockup - CPU#51 stuck for 11875s! [kworker/u512:1:103663] [60212.197482] watchdog: BUG: soft lockup - CPU#2 stuck for 11986s! [ppc2:203070] [60212.601441] watchdog: BUG: soft lockup - CPU#136 stuck for 10981s! [in:imklog:2885] [60220.348642] watchdog: BUG: soft lockup - CPU#54 stuck for 11767s! [sshd:103658] [60224.796184] watchdog: BUG: soft lockup - CPU#195 stuck for 10597s! [exim4:3249] [60227.463909] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [60227.475543] rcu: 65-...0: (1 GPs behind) idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1072610 [60227.494469] rcu: 175-...0: (6 ticks this GP) idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1072611 [60232.339409] watchdog: BUG: soft lockup - CPU#51 stuck for 11897s! [kworker/u512:1:103663] [60236.195010] watchdog: BUG: soft lockup - CPU#2 stuck for 12009s! [ppc2:203070] [60236.598968] watchdog: BUG: soft lockup - CPU#136 stuck for 11003s! [in:imklog:2885] [60244.346169] watchdog: BUG: soft lockup - CPU#54 stuck for 11789s! [sshd:103658] [60248.793715] watchdog: BUG: soft lockup - CPU#195 stuck for 10620s! [exim4:3249] [60256.336936] watchdog: BUG: soft lockup - CPU#51 stuck for 11920s! [kworker/u512:1:103663] [60260.192538] watchdog: BUG: soft lockup - CPU#2 stuck for 12031s! [ppc2:203070] [60260.596497] watchdog: BUG: soft lockup - CPU#136 stuck for 11026s! [in:imklog:2885] [60268.343699] watchdog: BUG: soft lockup - CPU#54 stuck for 11812s! [sshd:103658] [60272.791241] watchdog: BUG: soft lockup - CPU#195 stuck for 10642s! [exim4:3249] [60280.334465] watchdog: BUG: soft lockup - CPU#51 stuck for 11942s! [kworker/u512:1:103663] [60284.190067] watchdog: BUG: soft lockup - CPU#2 stuck for 12054s! [ppc2:203070] [60284.594026] watchdog: BUG: soft lockup - CPU#136 stuck for 11048s! [in:imklog:2885] [60290.525418] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [60290.537054] rcu: 65-...0: (1 GPs behind) idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1077860 [60290.555978] rcu: 175-...0: (6 ticks this GP) idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1077861 [60292.341228] watchdog: BUG: soft lockup - CPU#54 stuck for 11834s! [sshd:103658] [60296.788771] watchdog: BUG: soft lockup - CPU#195 stuck for 10664s! [exim4:3249] Serial console stopped. 0-> set /HOST send_break_action=break Set 'send_break_action' to 'break' 0-> start /host/console Are you sure you want to start /HOST/console (y/n)? y Serial console started. To stop, type #. [60395.353420] sysrq: HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show- registers(p) show-all-timers(q) unraw(r) sync(s) show-task- states(t) unmount(u) show-blocked-tasks(w) global-pmu(x) global- regs(y) dump-ftrace-buffer(z) But I can't get it to respond to any of these. It's usually dead too soon for me to try to get anything from 'dmesg'. That usually happens within a few minutes of the first stall. >From a few days ago, 'fpmake' is the task that was implicated: https://paste.debian.net/plainh/1162e193 Hoping to see if there are any kernel traces so I can report it to the relevant kernel lists. It's also a huge waste of time trying to reset the machine since it (a) takes forever to boot, and (b) only boots less than 1/2 the time, and (c) requires interaction to perform the resets. So resetting the machine could turn into a 30-minute ordeal: https://paste.debian.net/plainh/fb20bd17 I don't want to jump to the conclusion that it is a hardware issue, but having other hardware to test would be helpful, should anyone have spare memory for it or want to donate $ for research. ZV _______________________________________________ cfarm-users mailing list cfarm-users@lists.tetaneutral.net https://lists.tetaneutral.net/listinfo/cfarm-users