Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER Sun, 13 Sep 2020 21:55:18 -0700

I wonder if something like pacemaker sbd could be implemented in proxmox as 
extra layer of protection ?


http://manpages.ubuntu.com/manpages/bionic/man8/sbd.8.html

(shared disk heartbeat).

Something like a independent daemon (not using corosync/pmxcfs/...), also 
connected to watchdog muxer.

----- Mail original -----
De: "Thomas Lamprecht" <t.lampre...@proxmox.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, 
"aderumier" <aderum...@odiso.com>
Envoyé: Jeudi 10 Septembre 2020 20:21:14
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

On 10.09.20 13:34, Alexandre DERUMIER wrote: 
>>> as said, if the other nodes where not using HA, the watchdog-mux had no 
>>> client which could expire. 
> 
> sorry, maybe I have wrong explained it, 
> but all my nodes had HA enabled. 
> 
> I have double check lrm_status json files from my morning backup 2h before 
> the problem, 
> they were all in "active" state. ("state":"active","mode":"active" ) 
> 

OK, so all had a connection to the watchdog-mux open. This shifts the suspicion 
again over to pmxcfs and/or corosync. 

> I don't why node7 don't have rebooted, the only difference is that is was the 
> crm master. 
> (I think crm also reset the watchdog counter ? maybe behaviour is different 
> than lrm ?) 

The watchdog-mux stops updating the real watchdog as soon any client 
disconnects or times 
out. It does not know which client (daemon) that was. 

>>> above lines also indicate very high load. 
>>> Do you have some monitoring which shows the CPU/IO load before/during this 
>>> event? 
> 
> load (1,5,15 ) was: 6 (for 48cores), cpu usage: 23% 
> no iowait on disk (vms are on a remote ceph, only proxmox services are 
> running on local ssd disk) 
> 
> so nothing strange here :/ 

Hmm, the long loop times could then be the effect of a pmxcfs read or write 
operation being (temporarily) stuck. 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Reply via email to