Re: [cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Jacob Bachmeyer via cfarm-users Wed, 30 Nov 2022 19:22:36 -0800

Zach van Rijn via cfarm-users wrote:

On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
wrote:

Just got this:
Message from syslogd@gcc102 at Nov 30 04:31:20 ...
  kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
stuck for 48s! [ppc2:203070]


Can I do anything to help figuring out the problem?


Not sure. Ideas are certainly welcome. Had you used this machine
much before a few days ago? This issue had occurred maybe twice
in the last two years.

Do you have logs farther back? Pierre Muller reported a messagebroadcast from syslogd at 47393.509723; at that time CPU#2 had alreadybeen stuck (nominally) for 48 seconds, so the previous minute or so ofthe kernel logs might be helpful.

After stitching the split lines back together, a quick analysis (with`awk '$2 == "watchdog:" { $2 = $1; sub(/\[/,"",$2);sub(/\.[0-9]+\]/,"",$2); sub(/CPU#/, "", $7); sub(/s!/, "", $10); printf "%s %3d %6d -> %6d (%s)\n",$1,$7,$10,$2-$10,$11 }'log | sort -k 2`) yields:


8<------
[60140.204902]   2  11919 ->  48221 ([ppc2:203070])
[60164.202428]   2  11942 ->  48222 ([ppc2:203070])
[60188.199955]   2  11964 ->  48224 ([ppc2:203070])
[60212.197482]   2  11986 ->  48226 ([ppc2:203070])
[60236.195010]   2  12009 ->  48227 ([ppc2:203070])
[60260.192538]   2  12031 ->  48229 ([ppc2:203070])
[60284.190067]   2  12054 ->  48230 ([ppc2:203070])
[60160.346825]  51  11830 ->  48330 ([kworker/u512:1:103663])
[60184.344352]  51  11853 ->  48331 ([kworker/u512:1:103663])
[60208.341880]  51  11875 ->  48333 ([kworker/u512:1:103663])
[60232.339409]  51  11897 ->  48335 ([kworker/u512:1:103663])
[60256.336936]  51  11920 ->  48336 ([kworker/u512:1:103663])
[60280.334465]  51  11942 ->  48338 ([kworker/u512:1:103663])
[60148.356060]  54  11700 ->  48448 ([sshd:103658])
[60172.353588]  54  11722 ->  48450 ([sshd:103658])
[60196.351113]  54  11745 ->  48451 ([sshd:103658])
[60220.348642]  54  11767 ->  48453 ([sshd:103658])
[60244.346169]  54  11789 ->  48455 ([sshd:103658])
[60268.343699]  54  11812 ->  48456 ([sshd:103658])
[60292.341228]  54  11834 ->  48458 ([sshd:103658])
[60140.608860] 136  10914 ->  49226 ([in:imklog:2885])
[60164.606387] 136  10936 ->  49228 ([in:imklog:2885])
[60188.603913] 136  10959 ->  49229 ([in:imklog:2885])
[60212.601441] 136  10981 ->  49231 ([in:imklog:2885])
[60236.598968] 136  11003 ->  49233 ([in:imklog:2885])
[60260.596497] 136  11026 ->  49234 ([in:imklog:2885])
[60284.594026] 136  11048 ->  49236 ([in:imklog:2885])
[60152.803603] 195  10530 ->  49622 ([exim4:3249])
[60176.801129] 195  10553 ->  49623 ([exim4:3249])
[60200.798657] 195  10575 ->  49625 ([exim4:3249])
[60224.796184] 195  10597 ->  49627 ([exim4:3249])
[60248.793715] 195  10620 ->  49628 ([exim4:3249])
[60272.791241] 195  10642 ->  49630 ([exim4:3249])
[60296.788771] 195  10664 ->  49632 ([exim4:3249])
8<------

The columns above are the timestamp from the log, the CPU number, theamount of time that CPU is reported stuck, a projected time that theprocessor became stuck, and the reported stuck process. This looks likekernel bugs are involved, since the projected times slip by about one tothree seconds with each report. Interestingly, sorting this log oncolumn 2 (CPU number, for report convenience) and column 5 (projectedlockup time) produce the same result. I have no idea if this is acoincidence, or if the problem starts on lower-numbered CPUs and spreads"upward" on the machine.

Further extrapolation from this analysis could probably locate theactual lockup times for each processor, and we know that CPU#2 firstlocked up close to 47345 from Pierre's report. The kernel log fromaround that time might be very interesting, especially if there is an"oops" in there.

Speculation: (based on a bug I encountered long ago in the ext3 driver)ppc2 crashed its processor while holding some kernel resource (that thisis possible is itself a kernel bug), some time later a kworkerdeadlocked waiting for that resource to be released, some time laterstill, the other affected processes deadlocked either on that sameresource or on something held by a previous deadlocked process, whilethemselves holding kernel resources. As more kernel resources arecaught in deadlocks, the odds steadily increase for the entire system tohang. The ext3 bug I mentioned would cause the process that triggeredthe bug to take a kernel oops, but locks relevant to the filesystemmount were not released, so all further accesses to the affectedfilesystem would hang. Since an orderly shutdown or reboot attempts tosync filesystems, it too would hang. The only way to recover control ofthe system was SysRq, (on that machine at that time) using "echo b>/proc/sysrq-trigger" to force a reboot. Note that the SysRq emergencysync would /also/ hang with that bug, and the affected machine was runas a network server with no local console attached.

If PID 103658 was the master sshd process and not a user's (dead)connection, CPU#54 locking up killed further SSH access to the machine.It is possible that the "failed" reboots you mentioned are simply sshdgetting caught on a locked up CPU before you are able to log in.



-- Jacob
_______________________________________________
cfarm-users mailing list
cfarm-users@lists.tetaneutral.net
https://lists.tetaneutral.net/listinfo/cfarm-users

Re: [cfarm-users] gcc102 (sparc64) down again for maintenance/repairs

Reply via email to