Looking at theses logs: Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs lock update failed - Permission denied
in PVE/HA/Env/PVE2.pm " my $ctime = time(); my $last_lock_time = $last->{lock_time} // 0; my $last_got_lock = $last->{got_lock}; my $retry_timeout = 120; # hardcoded lock lifetime limit from pmxcfs eval { mkdir $lockdir; # pve cluster filesystem not online die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir; if (($ctime - $last_lock_time) < $retry_timeout) { # try cfs lock update request (utime) if (utime(0, $ctime, $filename)) { $got_lock = 1; return; } die "cfs lock update failed - $!\n"; } " If the retry_timeout is = 120, could it explain why I don't have log on others node, if the watchdog trigger after 60s ? I don't known too much how locks are working in pmxcfs, but when a corosync member leave or join, and a new cluster memership is formed, could we have some lock lost or hang ? ----- Mail original ----- De: "aderumier" <aderum...@odiso.com> À: "dietmar" <diet...@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 7 Septembre 2020 11:32:13 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown >>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 >> >> >>No HA involved... I had already help this user some week ago https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093 HA was actived at this time. (Maybe the watchdog was still running, I'm not sure if you disable HA from all vms if LRM disable the watchdog ?) ----- Mail original ----- De: "dietmar" <diet...@proxmox.com> À: "aderumier" <aderum...@odiso.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Lundi 7 Septembre 2020 10:18:42 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown There is a similar report in the forum: https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111 No HA involved... > On 09/07/2020 9:19 AM Alexandre DERUMIER <aderum...@odiso.com> wrote: > > > >>Indeed, this should not happen. Do you use a spearate network for corosync? > > No, I use 2x40GB lacp link. > > >>was there high traffic on the network? > > but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) > > > The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) > > > From my understanding, watchdog-mux was still runing as the watchdog have > reset only after 1min and not 10s, > so it's like the lrm was blocked and not sending watchdog timer reset to > watchdog-mux. > > > I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able > to debug. > > > > >>What kind of maintenance was the reason for the shutdown? > > ram upgrade. (the server was running ok before shutdown, no hardware problem) > (I just shutdown the server, and don't have started it yet when problem > occur) > > > > >>Do you use the default corosync timeout values, or do you have a special > >>setup? > > > no special tuning, default values. (I don't have any retransmit since months > in the logs) > > >>Can you please post the full corosync config? > > (I have verified, the running version was corosync was 3.0.3 with libknet > 1.15) > > > here the config: > > " > logging { > debug: off > to_syslog: yes > } > > nodelist { > node { > name: m6kvm1 > nodeid: 1 > quorum_votes: 1 > ring0_addr: m6kvm1 > } > node { > name: m6kvm10 > nodeid: 10 > quorum_votes: 1 > ring0_addr: m6kvm10 > } > node { > name: m6kvm11 > nodeid: 11 > quorum_votes: 1 > ring0_addr: m6kvm11 > } > node { > name: m6kvm12 > nodeid: 12 > quorum_votes: 1 > ring0_addr: m6kvm12 > } > node { > name: m6kvm13 > nodeid: 13 > quorum_votes: 1 > ring0_addr: m6kvm13 > } > node { > name: m6kvm14 > nodeid: 14 > quorum_votes: 1 > ring0_addr: m6kvm14 > } > node { > name: m6kvm2 > nodeid: 2 > quorum_votes: 1 > ring0_addr: m6kvm2 > } > node { > name: m6kvm3 > nodeid: 3 > quorum_votes: 1 > ring0_addr: m6kvm3 > } > node { > name: m6kvm4 > nodeid: 4 > quorum_votes: 1 > ring0_addr: m6kvm4 > } > node { > name: m6kvm5 > nodeid: 5 > quorum_votes: 1 > ring0_addr: m6kvm5 > } > node { > name: m6kvm6 > nodeid: 6 > quorum_votes: 1 > ring0_addr: m6kvm6 > } > node { > name: m6kvm7 > nodeid: 7 > quorum_votes: 1 > ring0_addr: m6kvm7 > } > > node { > name: m6kvm8 > nodeid: 8 > quorum_votes: 1 > ring0_addr: m6kvm8 > } > node { > name: m6kvm9 > nodeid: 9 > quorum_votes: 1 > ring0_addr: m6kvm9 > } > } > > quorum { > provider: corosync_votequorum > } > > totem { > cluster_name: m6kvm > config_version: 19 > interface { > bindnetaddr: 10.3.94.89 > ringnumber: 0 > } > ip_version: ipv4 > secauth: on > transport: knet > version: 2 > } > > > > ----- Mail original ----- > De: "dietmar" <diet...@proxmox.com> > À: "aderumier" <aderum...@odiso.com>, "Proxmox VE development discussion" > <pve-devel@lists.proxmox.com> > Cc: "pve-devel" <pve-de...@pve.proxmox.com> > Envoyé: Dimanche 6 Septembre 2020 14:14:06 > Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean > shutdown > > > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) > > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) > > Indeed, this should not happen. Do you use a spearate network for corosync? > Or > was there high traffic on the network? What kind of maintenance was the > reason > for the shutdown? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel