I have done test with "kill -9 <pidofcorosync", and I have around 20s hang on other nodes, but after that it's become available again.
So, it's really something when corosync is in shutdown phase, and pmxcfs is running. So, for now, as workaround, I have changed /lib/systemd/system/pve-cluster.service #Wants=corosync.service #Before=corosync.service Requires=corosync.service After=corosync.service Like this, at shutdown, pve-cluster is stopped before corosync, and if I restart corosync, pve-cluster is stopped first. ----- Mail original ----- De: "aderumier" <aderum...@odiso.com> À: "Thomas Lamprecht" <t.lampre...@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 21 Septembre 2020 01:54:59 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown Hi, I have done a new test, this time with "systemctl stop corosync", wait 15s, "systemctl start corosync", wait 15s. I was able to reproduce it at corosync stop on node1, 1second later /etc/pve was locked on all other nodes. I have started corosync 10min later on node1, and /etc/pve has become writeable again on all nodes node1: corosync stop: 01:26:50 node2 : /etc/pve locked : 01:26:51 http://odisoweb1.odiso.net/corosync-stop.log pmxcfs : bt full all threads: https://gist.github.com/aderumier/c45af4ee73b80330367e416af858bc65 pmxcfs: coredump :http://odisoweb1.odiso.net/core.17995.gz node1:corosync start: 01:35:36 http://odisoweb1.odiso.net/corosync-start.log BTW, I have been contacted in pm on the forum by a user following this mailing thread, and he had exactly the same problem with a 7 nodes cluster recently. (shutting down 1 node, /etc/pve was locked until the node was restarted) ----- Mail original ----- De: "Thomas Lamprecht" <t.lampre...@proxmox.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>, "aderumier" <aderum...@odiso.com> Envoyé: Jeudi 17 Septembre 2020 13:35:55 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/17/20 12:02 PM, Alexandre DERUMIER wrote: > if needed, here my test script to reproduce it thanks, I'm now using this specific one, had a similar (but all nodes writes) running here since ~ two hours without luck yet, lets see how this behaves. > > node1 (restart corosync until node2 don't send the timestamp anymore) > ----- > > #!/bin/bash > > for i in `seq 10000`; do > now=$(date +"%T") > echo "restart corosync : $now" > systemctl restart corosync > for j in {1..59}; do > last=$(cat /tmp/timestamp) > curr=`date '+%s'` > diff=$(($curr - $last)) > if [ $diff -gt 20 ]; then > echo "too old" > exit 0 > fi > sleep 1 > done > done > > > > node2 (write to /etc/pve/test each second, then send the last timestamp to > node1) > ----- > #!/bin/bash > for i in {1..10000}; > do > now=$(date +"%T") > echo "Current time : $now" > curr=`date '+%s'` > ssh root@node1 "echo $curr > /tmp/timestamp" > echo "test" > /etc/pve/test > sleep 1 > done > _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel