about node1:  /etc/pve directory seem to be in bad state, that's why it can't 
be mount

ls -lah /etc/pve:
  ??????????   ? ?    ?         ?            ? pve

I have forced an lazy umount
umount -l /etc/pve 

and now it's working fine.

(so maybe when pmxcfs was killed, it don't have cleanly umount /etc/pve ? )


----- Mail original -----
De: "aderumier" <aderum...@odiso.com>
À: "Thomas Lamprecht" <t.lampre...@proxmox.com>
Cc: "Proxmox VE development discussion" <pve-devel@lists.proxmox.com>
Envoyé: Mardi 15 Septembre 2020 16:09:50
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

>> 
>>Can you try to give pmxcfs real time scheduling, e.g., by doing: 
>> 
>># systemctl edit pve-cluster 
>> 
>>And then add snippet: 
>> 
>> 
>>[Service] 
>>CPUSchedulingPolicy=rr 
>>CPUSchedulingPriority=99 

yes, sure, I'll do it now 


> I'm currently digging the logs 
>>Is your most simplest/stable reproducer still a periodic restart of corosync 
>>in one node? 

yes, a simple "systemctl restart corosync" on 1 node each minute 



After 1hour, it's still locked. 

on other nodes, I still have pmxfs logs like: 

Sep 15 15:36:31 m6kvm2 pmxcfs[3474]: [status] notice: received log 
Sep 15 15:46:21 m6kvm2 pmxcfs[3474]: [status] notice: received log 
Sep 15 15:46:23 m6kvm2 pmxcfs[3474]: [status] notice: received log 
... 


on node1, I just restarted the pve-cluster service with systemctl restart 
pve-cluster, 
the pmxcfs process was killed, but not able to start it again 
and after that the /etc/pve become writable again on others node. 

(I don't have rebooted yet node1, if you want more test on pmxcfs) 



root@m6kvm1:~# systemctl status pve-cluster 
● pve-cluster.service - The Proxmox VE cluster filesystem 
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor 
preset: enabled) 
Active: failed (Result: exit-code) since Tue 2020-09-15 15:52:11 CEST; 3min 29s 
ago 
Process: 12536 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION) 

Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Service 
RestartSec=100ms expired, scheduling restart. 
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Scheduled restart job, 
restart counter is at 5. 
Sep 15 15:52:11 m6kvm1 systemd[1]: Stopped The Proxmox VE cluster filesystem. 
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Start request repeated 
too quickly. 
Sep 15 15:52:11 m6kvm1 systemd[1]: pve-cluster.service: Failed with result 
'exit-code'. 
Sep 15 15:52:11 m6kvm1 systemd[1]: Failed to start The Proxmox VE cluster 
filesystem. 

manual "pmxcfs -d" 
https://gist.github.com/aderumier/4cd91d17e1f8847b93ea5f621f257c2e 




Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:38:24 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:38:30 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info 
(cluster name m6kvm, version = 20) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data 
syncronisation 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: starting data 
syncronisation 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request 
(epoch 1/3491/00000064) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received sync request 
(epoch 1/3491/00000063) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from 
leader 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue 
length 23 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue 
length 157 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to 
commit (got 4 inode updates) 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date 
Sep 15 14:38:32 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: 
queue length 31 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:39:25 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:39:31 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info 
(cluster name m6kvm, version = 20) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data 
syncronisation 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: starting data 
syncronisation 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request 
(epoch 1/3491/00000065) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received sync request 
(epoch 1/3491/00000064) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from 
leader 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue 
length 20 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to 
commit (got 9 inode updates) 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date 
Sep 15 14:39:33 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: 
queue length 25 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 
Sep 15 14:40:26 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:40:27 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:40:33 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info 
(cluster name m6kvm, version = 20) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data 
syncronisation 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: starting data 
syncronisation 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request 
(epoch 1/3491/00000066) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received sync request 
(epoch 1/3491/00000065) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: received all states 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: leader is 2/3474 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: synced members: 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: waiting for updates from 
leader 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_queue: queue 
length 23 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [status] notice: dfsm_deliver_queue: queue 
length 87 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: update complete - trying to 
commit (got 6 inode updates) 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: all data is up to date 
Sep 15 14:40:34 m6kvm1 pmxcfs[3491]: [dcdb] notice: dfsm_deliver_sync_queue: 
queue length 33 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_leave failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_leave failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_dispatch failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: node lost quorum 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: quorum_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [quorum] crit: can't initialize service 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: cmap_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [confdb] crit: can't initialize service 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] notice: start cluster connection 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: cpg_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [dcdb] crit: can't initialize service 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] notice: start cluster connection 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: cpg_initialize failed: 2 
Sep 15 14:41:28 m6kvm1 pmxcfs[3491]: [status] crit: can't initialize service 
Sep 15 14:41:34 m6kvm1 pmxcfs[3491]: [status] notice: update cluster info 
(cluster name m6kvm, version = 20) 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: node has quorum 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: starting data 
syncronisation 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: members: 1/3491, 2/3474, 
3/3566, 4/3805, 5/3835, 6/3862, 7/3797, 8/3808, 9/9541, 10/3787, 11/3799, 
12/3795, 13/3776, 14/3778 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: starting data 
syncronisation 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [dcdb] notice: received sync request 
(epoch 1/3491/00000067) 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received sync request 
(epoch 1/3491/00000066) 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: received all states 
Sep 15 14:41:35 m6kvm1 pmxcfs[3491]: [status] notice: all data is up to date 
Sep 15 14:47:54 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:02:55 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:17:56 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:32:57 m6kvm1 pmxcfs[3491]: [status] notice: received log 
Sep 15 15:47:58 m6kvm1 pmxcfs[3491]: [status] notice: received log 

----> restart 
2352 [ 15/09/2020 15:52:00 ] systemctl restart pve-cluster 


Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] crit: fuse_mount error: Transport 
endpoint is not connected 
Sep 15 15:52:10 m6kvm1 pmxcfs[12438]: [main] notice: exit proxmox configuration 
filesystem (-1) 
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] crit: fuse_mount error: Transport 
endpoint is not connected 
Sep 15 15:52:10 m6kvm1 pmxcfs[12529]: [main] notice: exit proxmox configuration 
filesystem (-1) 
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] crit: fuse_mount error: Transport 
endpoint is not connected 
Sep 15 15:52:10 m6kvm1 pmxcfs[12531]: [main] notice: exit proxmox configuration 
filesystem (-1) 
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] crit: fuse_mount error: Transport 
endpoint is not connected 
Sep 15 15:52:11 m6kvm1 pmxcfs[12533]: [main] notice: exit proxmox configuration 
filesystem (-1) 
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] crit: fuse_mount error: Transport 
endpoint is not connected 
Sep 15 15:52:11 m6kvm1 pmxcfs[12536]: [main] notice: exit proxmox configuration 
filesystem (-1) 


some interesting dmesg about "pvesr" 

[Tue Sep 15 14:45:34 2020] INFO: task pvesr:19038 blocked for more than 120 
seconds. 
[Tue Sep 15 14:45:34 2020] Tainted: P O 5.4.60-1-pve #1 
[Tue Sep 15 14:45:34 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message. 
[Tue Sep 15 14:45:34 2020] pvesr D 0 19038 1 0x00000080 
[Tue Sep 15 14:45:34 2020] Call Trace: 
[Tue Sep 15 14:45:34 2020] __schedule+0x2e6/0x6f0 
[Tue Sep 15 14:45:34 2020] ? filename_parentat.isra.57.part.58+0xf7/0x180 
[Tue Sep 15 14:45:34 2020] schedule+0x33/0xa0 
[Tue Sep 15 14:45:34 2020] rwsem_down_write_slowpath+0x2ed/0x4a0 
[Tue Sep 15 14:45:34 2020] down_write+0x3d/0x40 
[Tue Sep 15 14:45:34 2020] filename_create+0x8e/0x180 
[Tue Sep 15 14:45:34 2020] do_mkdirat+0x59/0x110 
[Tue Sep 15 14:45:34 2020] __x64_sys_mkdir+0x1b/0x20 
[Tue Sep 15 14:45:34 2020] do_syscall_64+0x57/0x190 
[Tue Sep 15 14:45:34 2020] entry_SYSCALL_64_after_hwframe+0x44/0xa9 




----- Mail original ----- 
De: "Thomas Lamprecht" <t.lampre...@proxmox.com> 
À: "aderumier" <aderum...@odiso.com>, "Proxmox VE development discussion" 
<pve-devel@lists.proxmox.com> 
Envoyé: Mardi 15 Septembre 2020 15:00:03 
Objet: Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown 

On 9/15/20 2:49 PM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> I have produce it again, 
> 
> now I can't write to /etc/pve/ from any node 
> 

OK, so seems to really be an issue in pmxcfs or between corosync and pmxcfs, 
not the HA LRM or watchdog mux itself. 

Can you try to give pmxcfs real time scheduling, e.g., by doing: 

# systemctl edit pve-cluster 

And then add snippet: 


[Service] 
CPUSchedulingPolicy=rr 
CPUSchedulingPriority=99 


And restart pve-cluster 

> I have also added some debug logs to pve-ha-lrm, and it was stuck in: 
> (but if /etc/pve is locked, this is normal) 
> 
> if ($fence_request) { 
> $haenv->log('err', "node need to be fenced - releasing agent_lock\n"); 
> $self->set_local_status({ state => 'lost_agent_lock'}); 
> } elsif (!$self->get_protected_ha_agent_lock()) { 
> $self->set_local_status({ state => 'lost_agent_lock'}); 
> } elsif ($self->{mode} eq 'maintenance') { 
> $self->set_local_status({ state => 'maintenance'}); 
> } 
> 
> 
> corosync quorum is currently ok 
> 
> I'm currently digging the logs 
Is your most simplest/stable reproducer still a periodic restart of corosync in 
one node? 


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to