At least 'ppc2' and 'fpmake' are most probably executable on
my user account that are generated by my cron jobs.
Maybe it would be wise to check if the machine is stable if my cron jobs are
disabled.
I am currently unable to login into gcc102.
I you restart the machine, please also disable the cron jobs of user muller
(myself),
and let's check if the machine is stable without my jobs.
It could be that my jobs are generating some illegal instructions...
Of course, on a stable kernel, this should never lead to instabilities of the
system itself...
I no lockup appears within a few days, we could try to reenable
my jobs and see if this correlates with the appearance of lockups.
Pierre
Le 30/11/2022 à 15:25, Zach van Rijn a écrit :
On Wed, 2022-11-30 at 11:35 +0100, Pierre Muller via cfarm-users
wrote:
Just got this:
Message from syslogd@gcc102 at Nov 30 04:31:20 ...
kernel:[47393.509723] watchdog: BUG: soft lockup - CPU#2
stuck for 48s! [ppc2:203070]
Can I do anything to help figuring out the problem?
Not sure. Ideas are certainly welcome. Had you used this machine
much before a few days ago? This issue had occurred maybe twice
in the last two years.
Here is what is being printed, over and over again:
[60140.204902] watchdog: BUG: soft lockup - CPU#2 stuck for
11919s! [ppc2:203070]
[60140.608860] watchdog: BUG: soft lockup - CPU#136 stuck for
10914s! [in:imklog:2885]
[60148.356060] watchdog: BUG: soft lockup - CPU#54 stuck for
11700s! [sshd:103658]
[60152.803603] watchdog: BUG: soft lockup - CPU#195 stuck for
10530s! [exim4:3249]
[60160.346825] watchdog: BUG: soft lockup - CPU#51 stuck for
11830s! [kworker/u512:1:103663]
[60164.202428] watchdog: BUG: soft lockup - CPU#2 stuck for
11942s! [ppc2:203070]
[60164.398409] rcu: INFO: rcu_sched detected stalls on
CPUs/tasks:
[60164.410066] rcu: 65-...0: (1 GPs behind)
idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1067360
[60164.428982] rcu: 175-...0: (6 ticks this GP)
idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1067361
[60164.606387] watchdog: BUG: soft lockup - CPU#136 stuck for
10936s! [in:imklog:2885]
[60172.353588] watchdog: BUG: soft lockup - CPU#54 stuck for
11722s! [sshd:103658]
[60176.801129] watchdog: BUG: soft lockup - CPU#195 stuck for
10553s! [exim4:3249]
[60184.344352] watchdog: BUG: soft lockup - CPU#51 stuck for
11853s! [kworker/u512:1:103663]
[60188.199955] watchdog: BUG: soft lockup - CPU#2 stuck for
11964s! [ppc2:203070]
[60188.603913] watchdog: BUG: soft lockup - CPU#136 stuck for
10959s! [in:imklog:2885]
[60196.351113] watchdog: BUG: soft lockup - CPU#54 stuck for
11745s! [sshd:103658]
[60200.798657] watchdog: BUG: soft lockup - CPU#195 stuck for
10575s! [exim4:3249]
[60208.341880] watchdog: BUG: soft lockup - CPU#51 stuck for
11875s! [kworker/u512:1:103663]
[60212.197482] watchdog: BUG: soft lockup - CPU#2 stuck for
11986s! [ppc2:203070]
[60212.601441] watchdog: BUG: soft lockup - CPU#136 stuck for
10981s! [in:imklog:2885]
[60220.348642] watchdog: BUG: soft lockup - CPU#54 stuck for
11767s! [sshd:103658]
[60224.796184] watchdog: BUG: soft lockup - CPU#195 stuck for
10597s! [exim4:3249]
[60227.463909] rcu: INFO: rcu_sched detected stalls on
CPUs/tasks:
[60227.475543] rcu: 65-...0: (1 GPs behind)
idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1072610
[60227.494469] rcu: 175-...0: (6 ticks this GP)
idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1072611
[60232.339409] watchdog: BUG: soft lockup - CPU#51 stuck for
11897s! [kworker/u512:1:103663]
[60236.195010] watchdog: BUG: soft lockup - CPU#2 stuck for
12009s! [ppc2:203070]
[60236.598968] watchdog: BUG: soft lockup - CPU#136 stuck for
11003s! [in:imklog:2885]
[60244.346169] watchdog: BUG: soft lockup - CPU#54 stuck for
11789s! [sshd:103658]
[60248.793715] watchdog: BUG: soft lockup - CPU#195 stuck for
10620s! [exim4:3249]
[60256.336936] watchdog: BUG: soft lockup - CPU#51 stuck for
11920s! [kworker/u512:1:103663]
[60260.192538] watchdog: BUG: soft lockup - CPU#2 stuck for
12031s! [ppc2:203070]
[60260.596497] watchdog: BUG: soft lockup - CPU#136 stuck for
11026s! [in:imklog:2885]
[60268.343699] watchdog: BUG: soft lockup - CPU#54 stuck for
11812s! [sshd:103658]
[60272.791241] watchdog: BUG: soft lockup - CPU#195 stuck for
10642s! [exim4:3249]
[60280.334465] watchdog: BUG: soft lockup - CPU#51 stuck for
11942s! [kworker/u512:1:103663]
[60284.190067] watchdog: BUG: soft lockup - CPU#2 stuck for
12054s! [ppc2:203070]
[60284.594026] watchdog: BUG: soft lockup - CPU#136 stuck for
11048s! [in:imklog:2885]
[60290.525418] rcu: INFO: rcu_sched detected stalls on
CPUs/tasks:
[60290.537054] rcu: 65-...0: (1 GPs behind)
idle=279c/1/0x4000000000000000 softirq=26048/26049 fqs=1077860
[60290.555978] rcu: 175-...0: (6 ticks this GP)
idle=31ec/1/0x4000000000000000 softirq=75868/75868 fqs=1077861
[60292.341228] watchdog: BUG: soft lockup - CPU#54 stuck for
11834s! [sshd:103658]
[60296.788771] watchdog: BUG: soft lockup - CPU#195 stuck for
10664s! [exim4:3249]
Serial console stopped.
0-> set /HOST send_break_action=break
Set 'send_break_action' to 'break'
0-> start /host/console
Are you sure you want to start /HOST/console (y/n)? y
Serial console started. To stop, type #.
[60395.353420] sysrq: HELP : loglevel(0-9) reboot(b) crash(c)
terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i)
thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l)
show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-
registers(p) show-all-timers(q) unraw(r) sync(s) show-task-
states(t) unmount(u) show-blocked-tasks(w) global-pmu(x) global-
regs(y) dump-ftrace-buffer(z)
But I can't get it to respond to any of these. It's usually dead
too soon for me to try to get anything from 'dmesg'. That usually
happens within a few minutes of the first stall.
From a few days ago, 'fpmake' is the task that was implicated:
https://paste.debian.net/plainh/1162e193
Hoping to see if there are any kernel traces so I can report it
to the relevant kernel lists.
It's also a huge waste of time trying to reset the machine since
it (a) takes forever to boot, and (b) only boots less than 1/2
the time, and (c) requires interaction to perform the resets.
So resetting the machine could turn into a 30-minute ordeal:
https://paste.debian.net/plainh/fb20bd17
I don't want to jump to the conclusion that it is a hardware
issue, but having other hardware to test would be helpful, should
anyone have spare memory for it or want to donate $ for research.
ZV
_______________________________________________
cfarm-users mailing list
cfarm-users@lists.tetaneutral.net
https://lists.tetaneutral.net/listinfo/cfarm-users