Looking at theses logs:

Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock 
update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs 
lock update failed - Permission denied

in PVE/HA/Env/PVE2.pm
    my $ctime = time();
    my $last_lock_time = $last->{lock_time} // 0;
    my $last_got_lock = $last->{got_lock};

    my $retry_timeout = 120; # hardcoded lock lifetime limit from pmxcfs

    eval {

        mkdir $lockdir;

        # pve cluster filesystem not online
        die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir;

        if (($ctime - $last_lock_time) < $retry_timeout) {
            # try cfs lock update request (utime)
            if (utime(0, $ctime, $filename))  {
                $got_lock = 1;
            die "cfs lock update failed - $!\n";

If the retry_timeout is = 120, could it explain why I don't have log on others 
node, if the watchdog trigger after 60s ?

I don't known too much how locks are working in pmxcfs, but when a corosync 
member leave or join, and a new cluster memership is formed,
could we have some lock lost or hang ?

----- Mail original -----
De: "aderumier" <aderum...@odiso.com>
À: "dietmar" <diet...@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 7 Septembre 2020 11:32:13
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>No HA involved... 

I had already help this user some week ago 


HA was actived at this time. (Maybe the watchdog was still running, I'm not 
sure if you disable HA from all vms if LRM disable the watchdog ?) 

----- Mail original ----- 
De: "dietmar" <diet...@proxmox.com> 
À: "aderumier" <aderum...@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Lundi 7 Septembre 2020 10:18:42 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

There is a similar report in the forum: 


No HA involved... 

> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderum...@odiso.com> wrote: 
> >>Indeed, this should not happen. Do you use a spearate network for corosync? 
> No, I use 2x40GB lacp link. 
> >>was there high traffic on the network? 
> but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) 
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) 
> From my understanding, watchdog-mux was still runing as the watchdog have 
> reset only after 1min and not 10s, 
> so it's like the lrm was blocked and not sending watchdog timer reset to 
> watchdog-mux. 
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able 
> to debug. 
> >>What kind of maintenance was the reason for the shutdown? 
> ram upgrade. (the server was running ok before shutdown, no hardware problem) 
> (I just shutdown the server, and don't have started it yet when problem 
> occur) 
> >>Do you use the default corosync timeout values, or do you have a special 
> >>setup? 
> no special tuning, default values. (I don't have any retransmit since months 
> in the logs) 
> >>Can you please post the full corosync config? 
> (I have verified, the running version was corosync was 3.0.3 with libknet 
> 1.15) 
> here the config: 
> " 
> logging { 
> debug: off 
> to_syslog: yes 
> } 
> nodelist { 
> node { 
> name: m6kvm1 
> nodeid: 1 
> quorum_votes: 1 
> ring0_addr: m6kvm1 
> } 
> node { 
> name: m6kvm10 
> nodeid: 10 
> quorum_votes: 1 
> ring0_addr: m6kvm10 
> } 
> node { 
> name: m6kvm11 
> nodeid: 11 
> quorum_votes: 1 
> ring0_addr: m6kvm11 
> } 
> node { 
> name: m6kvm12 
> nodeid: 12 
> quorum_votes: 1 
> ring0_addr: m6kvm12 
> } 
> node { 
> name: m6kvm13 
> nodeid: 13 
> quorum_votes: 1 
> ring0_addr: m6kvm13 
> } 
> node { 
> name: m6kvm14 
> nodeid: 14 
> quorum_votes: 1 
> ring0_addr: m6kvm14 
> } 
> node { 
> name: m6kvm2 
> nodeid: 2 
> quorum_votes: 1 
> ring0_addr: m6kvm2 
> } 
> node { 
> name: m6kvm3 
> nodeid: 3 
> quorum_votes: 1 
> ring0_addr: m6kvm3 
> } 
> node { 
> name: m6kvm4 
> nodeid: 4 
> quorum_votes: 1 
> ring0_addr: m6kvm4 
> } 
> node { 
> name: m6kvm5 
> nodeid: 5 
> quorum_votes: 1 
> ring0_addr: m6kvm5 
> } 
> node { 
> name: m6kvm6 
> nodeid: 6 
> quorum_votes: 1 
> ring0_addr: m6kvm6 
> } 
> node { 
> name: m6kvm7 
> nodeid: 7 
> quorum_votes: 1 
> ring0_addr: m6kvm7 
> } 
> node { 
> name: m6kvm8 
> nodeid: 8 
> quorum_votes: 1 
> ring0_addr: m6kvm8 
> } 
> node { 
> name: m6kvm9 
> nodeid: 9 
> quorum_votes: 1 
> ring0_addr: m6kvm9 
> } 
> } 
> quorum { 
> provider: corosync_votequorum 
> } 
> totem { 
> cluster_name: m6kvm 
> config_version: 19 
> interface { 
> bindnetaddr: 
> ringnumber: 0 
> } 
> ip_version: ipv4 
> secauth: on 
> transport: knet 
> version: 2 
> } 
> ----- Mail original ----- 
> De: "dietmar" <diet...@proxmox.com> 
> À: "aderumier" <aderum...@odiso.com>, "Proxmox VE development discussion" 
> <pve-devel@lists.proxmox.com> 
> Cc: "pve-devel" <pve-de...@pve.proxmox.com> 
> Envoyé: Dimanche 6 Septembre 2020 14:14:06 
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean 
> shutdown 
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> Indeed, this should not happen. Do you use a spearate network for corosync? 
> Or 
> was there high traffic on the network? What kind of maintenance was the 
> reason 
> for the shutdown? 

pve-devel mailing list 

pve-devel mailing list

Reply via email to