Hi,

I have produce it again, 

now I can't write to /etc/pve/ from any node


I have also added some debug logs to pve-ha-lrm, and it was stuck in:
(but if /etc/pve is locked, this is normal)

        if ($fence_request) {
            $haenv->log('err', "node need to be fenced - releasing 
agent_lock\n");
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif (!$self->get_protected_ha_agent_lock()) {
            $self->set_local_status({ state => 'lost_agent_lock'});
        } elsif ($self->{mode} eq 'maintenance') {
            $self->set_local_status({ state => 'maintenance'});
        }


corosync quorum is currently ok

I'm currently digging the logs

----- Mail original -----
De: "aderumier" <aderum...@odiso.com>
À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Cc: "Thomas Lamprecht" <t.lampre...@proxmox.com>
Envoyé: Mardi 15 Septembre 2020 13:04:31
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

also logs of node14, where the lrm was not too long 

https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d 


----- Mail original ----- 
De: "aderumier" <aderum...@odiso.com> 
À: "Thomas Lamprecht" <t.lampre...@proxmox.com> 
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 12:15:47 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

here the previous restart log 

node1 -> corosync restart at 10:46:15 
----- 
https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 


node2 
----- 
https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba 



node5 
------ 
https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 

----- Mail original ----- 
De: "Thomas Lamprecht" <t.lampre...@proxmox.com> 
À: "aderumier" <aderum...@odiso.com>, "Proxmox VE development discussion" 
<pve-devel@lists.proxmox.com> 
Cc: "dietmar" <diet...@proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 11:46:51 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have finally reproduce it ! 
> 
> But this is with a corosync restart in cron each 1 minute, on node1 
> 
> Then: lrm was stuck for too long for around 60s and softdog have been 
> triggered on multiple other nodes. 
> 
> here the logs with full corosync debug at the time of last corosync restart. 
> 
> node1 (where corosync is restarted each minute) 
> https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e 
> 
> node2 
> https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 
> 
> node5 
> https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 
> 
> I'll prepare logs from the previous corosync restart, as the lrm seem to be 
> already stuck before. 

Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 

> Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________ 
pve-devel mailing list 
pve-devel@lists.proxmox.com 
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to