Zach van Rijn via cfarm-users wrote:
On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
wrote:
Just got this:
Message from syslogd@gcc102 at Nov 30 04:31:20 ...
  kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
stuck for 48s! [ppc2:203070]

Can I do anything to help figuring out the problem?

Not sure. Ideas are certainly welcome. Had you used this machine
much before a few days ago? This issue had occurred maybe twice
in the last two years.

Do you have logs farther back? Pierre Muller reported a message broadcast from syslogd at 47393.509723; at that time CPU#2 had already been stuck (nominally) for 48 seconds, so the previous minute or so of the kernel logs might be helpful.

After stitching the split lines back together, a quick analysis (with `awk '$2 == "watchdog:" { $2 = $1; sub(/\[/,"",$2); sub(/\.[0-9]+\]/,"",$2); sub(/CPU#/, "", $7); sub(/s! /, "", $10); printf "%s %3d %6d -> %6d (%s)\n",$1,$7,$10,$2-$10,$11 }' log | sort -k 2`) yields:

8<------
[60140.204902]   2  11919 ->  48221 ([ppc2:203070])
[60164.202428]   2  11942 ->  48222 ([ppc2:203070])
[60188.199955]   2  11964 ->  48224 ([ppc2:203070])
[60212.197482]   2  11986 ->  48226 ([ppc2:203070])
[60236.195010]   2  12009 ->  48227 ([ppc2:203070])
[60260.192538]   2  12031 ->  48229 ([ppc2:203070])
[60284.190067]   2  12054 ->  48230 ([ppc2:203070])
[60160.346825]  51  11830 ->  48330 ([kworker/u512:1:103663])
[60184.344352]  51  11853 ->  48331 ([kworker/u512:1:103663])
[60208.341880]  51  11875 ->  48333 ([kworker/u512:1:103663])
[60232.339409]  51  11897 ->  48335 ([kworker/u512:1:103663])
[60256.336936]  51  11920 ->  48336 ([kworker/u512:1:103663])
[60280.334465]  51  11942 ->  48338 ([kworker/u512:1:103663])
[60148.356060]  54  11700 ->  48448 ([sshd:103658])
[60172.353588]  54  11722 ->  48450 ([sshd:103658])
[60196.351113]  54  11745 ->  48451 ([sshd:103658])
[60220.348642]  54  11767 ->  48453 ([sshd:103658])
[60244.346169]  54  11789 ->  48455 ([sshd:103658])
[60268.343699]  54  11812 ->  48456 ([sshd:103658])
[60292.341228]  54  11834 ->  48458 ([sshd:103658])
[60140.608860] 136  10914 ->  49226 ([in:imklog:2885])
[60164.606387] 136  10936 ->  49228 ([in:imklog:2885])
[60188.603913] 136  10959 ->  49229 ([in:imklog:2885])
[60212.601441] 136  10981 ->  49231 ([in:imklog:2885])
[60236.598968] 136  11003 ->  49233 ([in:imklog:2885])
[60260.596497] 136  11026 ->  49234 ([in:imklog:2885])
[60284.594026] 136  11048 ->  49236 ([in:imklog:2885])
[60152.803603] 195  10530 ->  49622 ([exim4:3249])
[60176.801129] 195  10553 ->  49623 ([exim4:3249])
[60200.798657] 195  10575 ->  49625 ([exim4:3249])
[60224.796184] 195  10597 ->  49627 ([exim4:3249])
[60248.793715] 195  10620 ->  49628 ([exim4:3249])
[60272.791241] 195  10642 ->  49630 ([exim4:3249])
[60296.788771] 195  10664 ->  49632 ([exim4:3249])
8<------


The columns above are the timestamp from the log, the CPU number, the amount of time that CPU is reported stuck, a projected time that the processor became stuck, and the reported stuck process. This looks like kernel bugs are involved, since the projected times slip by about one to three seconds with each report. Interestingly, sorting this log on column 2 (CPU number, for report convenience) and column 5 (projected lockup time) produce the same result. I have no idea if this is a coincidence, or if the problem starts on lower-numbered CPUs and spreads "upward" on the machine.

Further extrapolation from this analysis could probably locate the actual lockup times for each processor, and we know that CPU#2 first locked up close to 47345 from Pierre's report. The kernel log from around that time might be very interesting, especially if there is an "oops" in there.

Speculation: (based on a bug I encountered long ago in the ext3 driver) ppc2 crashed its processor while holding some kernel resource (that this is possible is itself a kernel bug), some time later a kworker deadlocked waiting for that resource to be released, some time later still, the other affected processes deadlocked either on that same resource or on something held by a previous deadlocked process, while themselves holding kernel resources. As more kernel resources are caught in deadlocks, the odds steadily increase for the entire system to hang. The ext3 bug I mentioned would cause the process that triggered the bug to take a kernel oops, but locks relevant to the filesystem mount were not released, so all further accesses to the affected filesystem would hang. Since an orderly shutdown or reboot attempts to sync filesystems, it too would hang. The only way to recover control of the system was SysRq, (on that machine at that time) using "echo b >/proc/sysrq-trigger" to force a reboot. Note that the SysRq emergency sync would /also/ hang with that bug, and the affected machine was run as a network server with no local console attached.

If PID 103658 was the master sshd process and not a user's (dead) connection, CPU#54 locking up killed further SSH access to the machine. It is possible that the "failed" reboots you mentioned are simply sshd getting caught on a locked up CPU before you are able to log in.


-- Jacob
_______________________________________________
cfarm-users mailing list
cfarm-users@lists.tetaneutral.net
https://lists.tetaneutral.net/listinfo/cfarm-users

Reply via email to