Hi, I have produce it again,
now I can't write to /etc/pve/ from any node I have also added some debug logs to pve-ha-lrm, and it was stuck in: (but if /etc/pve is locked, this is normal) if ($fence_request) { $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); $self->set_local_status({ state => 'lost_agent_lock'}); } elsif (!$self->get_protected_ha_agent_lock()) { $self->set_local_status({ state => 'lost_agent_lock'}); } elsif ($self->{mode} eq 'maintenance') { $self->set_local_status({ state => 'maintenance'}); } corosync quorum is currently ok I'm currently digging the logs ----- Mail original ----- De: "aderumier" <aderum...@odiso.com> À: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "Thomas Lamprecht" <t.lampre...@proxmox.com> Envoyé: Mardi 15 Septembre 2020 13:04:31 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown also logs of node14, where the lrm was not too long https://gist.github.com/aderumier/a2e2d6afc7e04646c923ae6f37cb6c2d ----- Mail original ----- De: "aderumier" <aderum...@odiso.com> À: "Thomas Lamprecht" <t.lampre...@proxmox.com> Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Envoyé: Mardi 15 Septembre 2020 12:15:47 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown here the previous restart log node1 -> corosync restart at 10:46:15 ----- https://gist.github.com/aderumier/0992051d20f51270ceceb5b3431d18d7 node2 ----- https://gist.github.com/aderumier/eea0c50fefc1d8561868576f417191ba node5 ------ https://gist.github.com/aderumier/f2ce1bc5a93827045a5691583bbc7a37 ----- Mail original ----- De: "Thomas Lamprecht" <t.lampre...@proxmox.com> À: "aderumier" <aderum...@odiso.com>, "Proxmox VE development discussion" <pve-devel@lists.proxmox.com> Cc: "dietmar" <diet...@proxmox.com> Envoyé: Mardi 15 Septembre 2020 11:46:51 Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown On 9/15/20 11:35 AM, Alexandre DERUMIER wrote: > Hi, > > I have finally reproduce it ! > > But this is with a corosync restart in cron each 1 minute, on node1 > > Then: lrm was stuck for too long for around 60s and softdog have been > triggered on multiple other nodes. > > here the logs with full corosync debug at the time of last corosync restart. > > node1 (where corosync is restarted each minute) > https://gist.github.com/aderumier/c4f192fbce8e96759f91a61906db514e > > node2 > https://gist.github.com/aderumier/2d35ea05c1fbff163652e564fc430e67 > > node5 > https://gist.github.com/aderumier/df1d91cddbb6e15bb0d0193ed8df9273 > > I'll prepare logs from the previous corosync restart, as the lrm seem to be > already stuck before. Yeah that would be good, as yes the lrm seems to get stuck at around 10:46:21 > Sep 15 10:47:26 m6kvm2 pve-ha-lrm[3736]: loop take too long (65 seconds) _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel