I wonder if something like pacemaker sbd could be implemented in proxmox as extra layer of protection ?
http://manpages.ubuntu.com/manpages/bionic/man8/sbd.8.html (shared disk heartbeat). Something like a independent daemon (not using corosync/pmxcfs/...), also connected to watchdog muxer. ----- Mail original ----- De: "Thomas Lamprecht" <t.lampre...@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderum...@odiso.com> Envoyé: Jeudi 10 Septembre 2020 20:21:14 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 10.09.20 13:34, Alexandre DERUMIER wrote: >>> as said, if the other nodes where not using HA, the watchdog-mux had no >>> client which could expire. > > sorry, maybe I have wrong explained it, > but all my nodes had HA enabled. > > I have double check lrm_status json files from my morning backup 2h before > the problem, > they were all in "active" state. ("state":"active","mode":"active" ) > OK, so all had a connection to the watchdog-mux open. This shifts the suspicion again over to pmxcfs and/or corosync. > I don't why node7 don't have rebooted, the only difference is that is was the > crm master. > (I think crm also reset the watchdog counter ? maybe behaviour is different > than lrm ?) The watchdog-mux stops updating the real watchdog as soon any client disconnects or times out. It does not know which client (daemon) that was. >>> above lines also indicate very high load. >>> Do you have some monitoring which shows the CPU/IO load before/during this >>> event? > > load (1,5,15 ) was: 6 (for 48cores), cpu usage: 23% > no iowait on disk (vms are on a remote ceph, only proxmox services are > running on local ssd disk) > > so nothing strange here :/ Hmm, the long loop times could then be the effect of a pmxcfs read or write operation being (temporarily) stuck. _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel