Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Thomas Lamprecht Tue, 15 Sep 2020 06:00:44 -0700

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote:
> Hi,
> 
> I have produce it again, 
> 
> now I can't write to /etc/pve/ from any node
>


OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs,
not the HA LRM or watchdog mux itself.

Can you try to give pmxcfs real time scheduling, e.g., by doing:

# systemctl edit pve-cluster

And then add snippet:


[Service]
CPUSchedulingPolicy=rr
CPUSchedulingPriority=99


And restart pve-cluster

> I have also added some debug logs to pve-ha-lrm, and it was stuck in:
> (but if /etc/pve is locked, this is normal)
> 
>         if ($fence_request) {
>             $haenv->log('err', "node need to be fenced - releasing 
> agent_lock\n");
>             $self->set_local_status({ state => 'lost_agent_lock'});
>         } elsif (!$self->get_protected_ha_agent_lock()) {
>             $self->set_local_status({ state => 'lost_agent_lock'});
>         } elsif ($self->{mode} eq 'maintenance') {
>             $self->set_local_status({ state => 'maintenance'});
>         }
> 
> 
> corosync quorum is currently ok
> 
> I'm currently digging the logs
Is your most simplest/stable reproducer still a periodic restart of corosync in 
one node?


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Reply via email to