On 08.09.20 09:11, Alexandre DERUMIER wrote:
>>> It would really help if we can reproduce the bug somehow. Do you have and 
>>> idea how
>>> to trigger the bug?
> 
> I really don't known. I'm currently trying to reproduce on the same cluster, 
> with softdog && noboot=1, and rebooting node.
> 
> 
> Maybe it's related with the number of vms, or the number of nodes, don't have 
> any clue ...

I checked a bit the watchdog code, our user-space mux one and the kernel 
drivers,
and just noting a few things here (thinking out aloud):

The /dev/watchdog itself is always active, else we could  loose it to some
other program and not be able to activate HA dynamically.
But, as long as no HA service got active, it's a simple dummy "wake up every
second and do an ioctl keep-alive update".
This is really simple and efficiently written, so if that fails for over 10s
the systems is really loaded, probably barely responding to anything.

Currently the watchdog-mux runs as normal process, no re-nice, no real-time
scheduling. This is IMO wrong, as it is a critical process which needs to be
run with high priority. I've a patch here which sets it to the highest RR
realtime-scheduling priority available, effectively the same what corosync does.


diff --git a/src/watchdog-mux.c b/src/watchdog-mux.c
index 818ae00..71981d7 100644
--- a/src/watchdog-mux.c
+++ b/src/watchdog-mux.c
@@ -8,2 +8,3 @@
 #include <time.h>
+#include <sched.h>
 #include <sys/ioctl.h>
@@ -151,2 +177,15 @@ main(void)
 
+    int sched_priority = sched_get_priority_max (SCHED_RR);
+    if (sched_priority != -1) {
+        struct sched_param global_sched_param;
+        global_sched_param.sched_priority = sched_priority;
+        int res = sched_setscheduler (0, SCHED_RR, &global_sched_param);
+        if (res == -1) {
+            fprintf(stderr, "Could not set SCHED_RR at priority %d\n", 
sched_priority);
+        } else {
+            fprintf(stderr, "set SCHED_RR at priority %d\n", sched_priority);
+        }
+    }
+
+
     if ((watchdog_fd = open(WATCHDOG_DEV, O_WRONLY)) == -1) {

The issue with no HA but watchdog reset due to massively overloaded system
should be avoided already a lot with the scheduling change alone.

Interesting, IMO, is that lots of nodes rebooted at the same time, with no HA 
active.
This *could* come from a side-effect like ceph rebalacing kicking off and 
producing
a load spike for >10s, hindering the scheduling of the watchdog-mux.
This is a theory, but with HA off it needs to be something like that, as in 
HA-off
case there's *no* direct or indirect connection between corosync/pmxcfs and the
watchdog-mux. It simply does not cares, or notices, quorum partition changes at 
all.


There may be a approach to reserve the watchdog for the mux, but avoid having it
as "ticking time bomb":
Theoretically one could open it, then disable it with an ioctl (it can be 
queried
if a driver support that) and only enable it for real once the first client 
connects
to the MUX. This may not work for all watchdog modules, and if, we may want to 
make
it configurable, as some people actually want a reset if a (future) real-time 
process
cannot be scheduled for >= 10 seconds.

With HA active, well then there could be something off, either in corosync/knet 
or
also in how we interface with it in pmxcfs, that could well be, but won't 
explain the
non-HA issues.

Speaking of pmxcfs, that one runs also with standard priority, we may want to 
change
that too to a RT scheduler, so that its ensured it can process all corosync 
events.

I have also a few other small watchdog mux patches around, it should nowadays 
actually
be able to tell us why a reset happened (can also be over/under voltage, 
temperature,
...) and I'll repeat doing the ioctl for keep-alive a few times if it fails, 
can only
win with that after all.



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to