Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER Mon, 21 Sep 2020 22:44:41 -0700

I have done test with "kill -9 <pidofcorosync",  and I have around 20s hang on 
other nodes,
but after that it's become available again.



So, it's really something when corosync is in shutdown phase, and pmxcfs is 
running.

So, for now, as workaround, I have changed

/lib/systemd/system/pve-cluster.service

#Wants=corosync.service
#Before=corosync.service
Requires=corosync.service
After=corosync.service


Like this, at shutdown, pve-cluster is stopped before corosync, and if I 
restart corosync, pve-cluster is stopped first.




----- Mail original -----
De: "aderumier" <aderum...@odiso.com>
À: "Thomas Lamprecht" <t.lampre...@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 21 Septembre 2020 01:54:59
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Hi, 

I have done a new test, this time with "systemctl stop corosync", wait 15s, 
"systemctl start corosync", wait 15s. 

I was able to reproduce it at corosync stop on node1, 1second later /etc/pve 
was locked on all other nodes. 


I have started corosync 10min later on node1, and /etc/pve has become writeable 
again on all nodes 



node1: corosync stop: 01:26:50 
node2 : /etc/pve locked : 01:26:51 

http://odisoweb1.odiso.net/corosync-stop.log 


pmxcfs : bt full all threads: 

https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65 

pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz 


node1:corosync start: 01:35:36 
http://odisoweb1.odiso.net/corosync-start.log 





BTW, I have been contacted in pm on the forum by a user following this mailing 
thread, 
and he had exactly the same problem with a 7 nodes cluster recently. 
(shutting down 1 node, /etc/pve was locked until the node was restarted) 



----- Mail original ----- 
De: "Thomas Lamprecht" <t.lampre...@proxmox.com> 
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, 
"aderumier" <aderum...@odiso.com> 
Envoyé: Jeudi 17 Septembre 2020 13:35:55 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: 
> if needed, here my test script to reproduce it 

thanks, I'm now using this specific one, had a similar (but all nodes writes) 
running here since ~ two hours without luck yet, lets see how this behaves. 

> 
> node1 (restart corosync until node2 don't send the timestamp anymore) 
> ----- 
> 
> #!/bin/bash 
> 
> for i in `seq 10000`; do 
> now=$(date +"%T") 
> echo "restart corosync : $now" 
> systemctl restart corosync 
> for j in {1..59}; do 
> last=$(cat /tmp/timestamp) 
> curr=`date '+%s'` 
> diff=$(($curr - $last)) 
> if [ $diff -gt 20 ]; then 
> echo "too old" 
> exit 0 
> fi 
> sleep 1 
> done 
> done 
> 
> 
> 
> node2 (write to /etc/pve/test each second, then send the last timestamp to 
> node1) 
> ----- 
> #!/bin/bash 
> for i in {1..10000}; 
> do 
> now=$(date +"%T") 
> echo "Current time : $now" 
> curr=`date '+%s'` 
> ssh root@node1 "echo $curr > /tmp/timestamp" 
> echo "test" > /etc/pve/test 
> sleep 1 
> done 
> 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Reply via email to