Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Thomas Lamprecht Tue, 29 Sep 2020 23:27:58 -0700

Hi,

On 30.09.20 08:09, Alexandre DERUMIER wrote:
> some news, my last test is running for 14h now, and I don't have had any 
> problem :)
>


great! Thanks for all your testing time, this would have been much harder,
if even possible at all, without you probiving so much testing effort on a
production(!) cluster - appreciated!

Naturally many thanks to Fabian too, for reading so many logs without going
insane :-)

> So, it seem that is indeed fixed ! Congratulations !
> 

honza comfirmed Fabians suspicion about lacking guarantees of thread safety
for cpg_mcast_joined, which was sadly not documented, so this is surely
a bug, let's hope the last of such hard to reproduce ones.

> 
> 
> I wonder if it could be related to this forum user
> https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/
> 
> His problem is that after corosync lag (he's have 1 cluster stretch on 2DC 
> with 10km distance, so I think sometimes he's having some small lag,
> 1 node is flooding other nodes with a lot of udp packets. (and making things 
> worst, as corosync cpu is going to 100% / overloaded, and then can't see 
> other onodes

I can imagine this problem showing up as a a side effect of a flood where 
partition
changes happen. Not so sure that this can be the cause of that directly.

> 
> I had this problem 6month ago after shutting down a node, that's why I'm 
> thinking it could "maybe" related.
> 
> So, I wonder if it could be same pmxcfs bug, when something looping or send 
> again again packets.
> 
> The forum user seem to have the problem multiple times in some week, so maybe 
> he'll be able to test the new fixed pmxcs, and tell us if it's fixing this 
> bug too.

Testing once available would be sure a good idea for them.



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] corosync bug: cluster break after 1 node clean shutdown

Reply via email to