Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Alexandre DERUMIER Mon, 07 Sep 2020 06:24:11 -0700

Looking at theses logs:

Sep  3 10:40:51 m6kvm7 pve-ha-crm[16196]: lost lock 'ha_manager_lock - cfs lock 
update failed - Permission denied
Sep  3 10:40:51 m6kvm7 pve-ha-lrm[16140]: lost lock 'ha_agent_m6kvm7_lock - cfs 
lock update failed - Permission denied


in PVE/HA/Env/PVE2.pm
"
    my $ctime = time();
    my $last_lock_time = $last->{lock_time} // 0;
    my $last_got_lock = $last->{got_lock};

    my $retry_timeout = 120; # hardcoded lock lifetime limit from pmxcfs

    eval {

        mkdir $lockdir;

        # pve cluster filesystem not online
        die "can't create '$lockdir' (pmxcfs not mounted?)\n" if ! -d $lockdir;

        if (($ctime - $last_lock_time) < $retry_timeout) {
            # try cfs lock update request (utime)
            if (utime(0, $ctime, $filename))  {
                $got_lock = 1;
                return;
            }
            die "cfs lock update failed - $!\n";
        }
"


If the retry_timeout is = 120, could it explain why I don't have log on others 
node, if the watchdog trigger after 60s ?

I don't known too much how locks are working in pmxcfs, but when a corosync 
member leave or join, and a new cluster memership is formed,
could we have some lock lost or hang ?



----- Mail original -----
De: "aderumier" <aderum...@odiso.com>
À: "dietmar" <diet...@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Lundi 7 Septembre 2020 11:32:13
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>>https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111
>> 
>> 
>>No HA involved... 

I had already help this user some week ago 

https://forum.proxmox.com/threads/proxmox-6-2-4-cluster-die-node-auto-reboot-need-help.74643/#post-333093
 

HA was actived at this time. (Maybe the watchdog was still running, I'm not 
sure if you disable HA from all vms if LRM disable the watchdog ?) 


----- Mail original ----- 
De: "dietmar" <diet...@proxmox.com> 
À: "aderumier" <aderum...@odiso.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Lundi 7 Septembre 2020 10:18:42 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

There is a similar report in the forum: 

https://forum.proxmox.com/threads/cluster-die-after-adding-the-39th-node-proxmox-is-not-stable.75506/#post-336111
 

No HA involved... 


> On 09/07/2020 9:19 AM Alexandre DERUMIER <aderum...@odiso.com> wrote: 
> 
> 
> >>Indeed, this should not happen. Do you use a spearate network for corosync? 
> 
> No, I use 2x40GB lacp link. 
> 
> >>was there high traffic on the network? 
> 
> but I'm far from saturated them. (in pps or througput), (I'm around 3-4gbps) 
> 
> 
> The cluster is 14 nodes, with around 1000vms (with ha enabled on all vms) 
> 
> 
> From my understanding, watchdog-mux was still runing as the watchdog have 
> reset only after 1min and not 10s, 
> so it's like the lrm was blocked and not sending watchdog timer reset to 
> watchdog-mux. 
> 
> 
> I'll do tests with softdog + soft_noboot=1, so if that happen again,I'll able 
> to debug. 
> 
> 
> 
> >>What kind of maintenance was the reason for the shutdown? 
> 
> ram upgrade. (the server was running ok before shutdown, no hardware problem) 
> (I just shutdown the server, and don't have started it yet when problem 
> occur) 
> 
> 
> 
> >>Do you use the default corosync timeout values, or do you have a special 
> >>setup? 
> 
> 
> no special tuning, default values. (I don't have any retransmit since months 
> in the logs) 
> 
> >>Can you please post the full corosync config? 
> 
> (I have verified, the running version was corosync was 3.0.3 with libknet 
> 1.15) 
> 
> 
> here the config: 
> 
> " 
> logging { 
> debug: off 
> to_syslog: yes 
> } 
> 
> nodelist { 
> node { 
> name: m6kvm1 
> nodeid: 1 
> quorum_votes: 1 
> ring0_addr: m6kvm1 
> } 
> node { 
> name: m6kvm10 
> nodeid: 10 
> quorum_votes: 1 
> ring0_addr: m6kvm10 
> } 
> node { 
> name: m6kvm11 
> nodeid: 11 
> quorum_votes: 1 
> ring0_addr: m6kvm11 
> } 
> node { 
> name: m6kvm12 
> nodeid: 12 
> quorum_votes: 1 
> ring0_addr: m6kvm12 
> } 
> node { 
> name: m6kvm13 
> nodeid: 13 
> quorum_votes: 1 
> ring0_addr: m6kvm13 
> } 
> node { 
> name: m6kvm14 
> nodeid: 14 
> quorum_votes: 1 
> ring0_addr: m6kvm14 
> } 
> node { 
> name: m6kvm2 
> nodeid: 2 
> quorum_votes: 1 
> ring0_addr: m6kvm2 
> } 
> node { 
> name: m6kvm3 
> nodeid: 3 
> quorum_votes: 1 
> ring0_addr: m6kvm3 
> } 
> node { 
> name: m6kvm4 
> nodeid: 4 
> quorum_votes: 1 
> ring0_addr: m6kvm4 
> } 
> node { 
> name: m6kvm5 
> nodeid: 5 
> quorum_votes: 1 
> ring0_addr: m6kvm5 
> } 
> node { 
> name: m6kvm6 
> nodeid: 6 
> quorum_votes: 1 
> ring0_addr: m6kvm6 
> } 
> node { 
> name: m6kvm7 
> nodeid: 7 
> quorum_votes: 1 
> ring0_addr: m6kvm7 
> } 
> 
> node { 
> name: m6kvm8 
> nodeid: 8 
> quorum_votes: 1 
> ring0_addr: m6kvm8 
> } 
> node { 
> name: m6kvm9 
> nodeid: 9 
> quorum_votes: 1 
> ring0_addr: m6kvm9 
> } 
> } 
> 
> quorum { 
> provider: corosync_votequorum 
> } 
> 
> totem { 
> cluster_name: m6kvm 
> config_version: 19 
> interface { 
> bindnetaddr: 10.3.94.89 
> ringnumber: 0 
> } 
> ip_version: ipv4 
> secauth: on 
> transport: knet 
> version: 2 
> } 
> 
> 
> 
> ----- Mail original ----- 
> De: "dietmar" <diet...@proxmox.com> 
> À: "aderumier" <aderum...@odiso.com>, "Proxmox VE development discussion" 
> <pve-devel@lists.proxmox.com> 
> Cc: "pve-devel" <pve-de...@pve.proxmox.com> 
> Envoyé: Dimanche 6 Septembre 2020 14:14:06 
> Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean 
> shutdown 
> 
> > Sep 3 10:40:51 m6kvm7 pve-ha-lrm[16140]: loop take too long (87 seconds) 
> > Sep 3 10:40:51 m6kvm7 pve-ha-crm[16196]: loop take too long (92 seconds) 
> 
> Indeed, this should not happen. Do you use a spearate network for corosync? 
> Or 
> was there high traffic on the network? What kind of maintenance was the 
> reason 
> for the shutdown? 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Reply via email to