Trying to dig into wsrep status, I found that:

On donor node (30 minutes after SST complete):
wsrep_local_send_queue_avg      8.77932 (now 8.69799)
wsrep_local_recv_queue_avg      0.192287

On newly started node:
wsrep_local_send_queue_avg      0.00315457
wsrep_local_recv_queue_avg      61.1511 (now 52.7237)

The big values are decreasing slowly, the now is after finishing to write the 
email.

De : William Edwards <wedwa...@cyberfusion.nl>
Envoyé : mercredi 27 juillet 2022 12:45
À : Cédric Counotte <cedric.couno...@1check.com>
Cc : maria-discuss@lists.launchpad.net
Objet : Re: [Maria-discuss] MariaDB server horribly slow on start

Hi,
Op 27 jul. 2022 om 12:37 heeft Cédric Counotte 
<cedric.couno...@1check.com<mailto:cedric.couno...@1check.com>> het volgende 
geschreven:

Thanks for your reply !

If the server does an SST, the problem is way more dramatic than when it does 
an IST.

This morning one server crashed and upon restarting it did an SST instead of an 
IST, and the issue was horrible.
Even before being available, it blocked the donor for 15 minutes with something 
like those:

2022-07-27 12:02:42 7 [Note] WSREP: Processing event queue:... 20.9% ( 496/2376 
events) complete.

Does the issue occur while these messages are logged?

For a while it got even slower to process the queue than the queue was 
increasing.

The same server crashed again so I started another one and it did an SST, but 
the problem was not as dramatic, however the processing even queue lasted 5 
minutes and blocked the donor completed for that time. In very rare occasions 
the SST is not causing such issues, but very rare (twice in 6 months and 2 or 3 
dozen of issue occurrences) and I didn’t change any settings since!? Very 
confusing.

When servers do an SST, I usually kill the CHECK TABLE FOR UPGRADE that occurs 
as it appears to slow things down even more.

Noticeably this morning I had 3 servers running, one went haywire, and caused 
another one to go down! Ended-up with a single server I had to restart caused 
it would complain about not being wsrep ready.


It’s been a very bad day today as those 4 servers are in production and we 
received dozens of calls from our customers.

Again, I’d focus on cause. The effect is clear.


Now I’m back with 2 servers and will wait tonight to restart the 2 others 
because of that issue.

IMO it’s a bug as in very rare occasions it starts smoothly. But still I found 
galera to be unreliable and my company is asking me to install a more reliable 
solution ASAP or we will loose customers! So any help would be much appreciated.

Whether something’s a bug is not an opinion.

I’m thinking of using 3 servers with replication instead, keeping load 
balancing using source Ips, but I’m worried that this might be less reliable. 
We have 2 spare servers in another location, synched with replication but it 
happened too often that upon a server crash the replication would no longer 
start and had to be entirely restarted which shows as not being even less 
reliable.

Sorry for the long story, but I’m no Galera expert

Then you could indeed wonder if your company should be using Galera …

and I’m having lots of issues I can’t find any info or solution about.

This is another issue I’m facing with replication, while it seems to be caused 
by galera cluster: https://jira.mariadb.org/browse/MDEV-29132



De : William Edwards <wedwa...@cyberfusion.nl<mailto:wedwa...@cyberfusion.nl>>
Envoyé : mercredi 27 juillet 2022 11:58
À : Cédric Counotte 
<cedric.couno...@1check.com<mailto:cedric.couno...@1check.com>>
Cc : maria-discuss@lists.launchpad.net<mailto:maria-discuss@lists.launchpad.net>
Objet : Re: [Maria-discuss] MariaDB server horribly slow on start


Op 27 jul. 2022 om 11:46 heeft Cédric Counotte 
<cedric.couno...@1check.com<mailto:cedric.couno...@1check.com>> het volgende 
geschreven:


Hello all. I hope I’m at the right place to ask this question.

I opened a bug here: https://jira.mariadb.org/browse/MDEV-28969, however I was 
told to use this mailing list.



We have 4 MariaDB servers in a Galera Cluster and it happens that a server has 
to be restarted (be it for a crash which I have to open a bug for) or 
maintenance.



When that happens, the restarted server is causing huge slow down on the whole 
cluster, and it lasts for 10 to 30 minutes at the very least!



And by huge, I mean huge, we end up with 500 to 800 pending queries on all 
servers as you can see on attached screenshots

I’ve attached the configuration of any server for reference in case this is the 
source of the issue.



Any way to solve this would be greatly appreciated.

You seem to be focusing on effect. What is the cause? SST?




Regards,

3C.
[image001.png]
_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : 
maria-discuss@lists.launchpad.net<mailto:maria-discuss@lists.launchpad.net>
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp
_______________________________________________
Mailing list: https://launchpad.net/~maria-discuss
Post to     : maria-discuss@lists.launchpad.net
Unsubscribe : https://launchpad.net/~maria-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to