Hello!
I am very happy having found this bug report as it is possible that the
NMI watchdog mechanism has given me serious headaches since Debian
kernel 2.6.38 was released! I cannot tell it definitely yet as it is an
intermittent error in my case which may take up to a week to appear
once, and I disabled the NMI watchdog mechanism by adding "nowatchdog"
not until a few hours ago when I came across this bug report.
A short summary of my problem:
- among several uniprocessor systems with Debian and Ubuntu, I am
running several older multiprocessor servers (IBM Netfinity 5000
(Dual-P3), IBM Netfinity 7000 M10 (Quad-P3-Xeon) and IBM xSeries 232
(Dual P3-Tualatin)) with Debian (using testing as "rolling release"
after a long time with lenny)
- the systems were running rock-solid up to and including the
Debian-packaged kernel 2.6.32
- when Debian-packaged kernel 2.6.38 came out, my problem started and
appeared mainly on the Netfinity 5000 (but less often also on the other
systems): after running continuously for one to eight days, the system
suddenly locked up hard, in most cases it was just idle when this happened
- this lockup was a classic livelock which can be diagnosed nicely on
these IBM machines as they have activity LEDs for each CPU which glowed
with identical brightness and without any modulation, so both CPUs were
switching between each other with short cycles
- when comparing the basic system data and properties, I noticed a
difference between kernel 2.6.32 and 2.6.38: the latter caused a
continuously rising NMI count on each CPU which could not be seen with
2.6.32! Today I know where these NMIs are coming from: it is the
watchdog mechanism also causing your laptop problem
- I hoped that the problem might disappear with kernel 3.4 as there were
a few discussions on LKML about several livelocks/deadlocks related to
timers and the like (the config change concerning the "lockup detector"
which got enabled from 2.6.32 to 2.6.38 remained unnoticed for me)
- as you see it on the laptop, this lockup NEVER allows to get any
message out via the debugging mechanisms, not even by attaching a serial
cable and logging the console output on a second machine
- now using kernel 3.4.2, the problem still exists, but has changed a
bit in its consequences - instead of a livelock, it is a deadlock in
most cases and activity stays on a single CPU, sometimes even causing a
reboot instead of staying locked up
- on a German forum I described the problem, but nobody could point me
to this lockup-detector change in the kernel config though I posted this
significant change from "no NMIs" to "continuous NMIs". Here we see
again how bad the documentation of open-source projects sometimes is
cared about... even when configuring a kernel, the config help says that
the nmi watchdog had to be enabled consciously by a boot parameter - in
fact it seems to be activated by default as soon as SMP code is loaded
and/or an APIC is detected (but though the presence of an APIC, I have
not seen those NMIs on my uniprocessor P3 machines yet).
Here is a link to my description on the German "debianforum":
http://debianforum.de/forum/viewtopic.php?f=33&t=134210
I would like to report the bug to http://bugzilla.kernel.org if it has
not yet been done by someone else. Therefore it would be great if you
could give me a short note if you have reported it already.
Basically I think this mechanism has its bugs and/or wrong assumptions
on some machines and should undergo a critical review. I'm wondering if
there are more people in the world getting set up by strange lockups of
their machines which are wrongly diagnosed as "hardware errors" etc.
Hope to read from you soon!
Thanks and best regards,
Hans-Juergen
--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4fe23d88.4060...@gmx.net