Found another thread which was very similar to this about setting the 
osd_async_recovery_min_cost=0, however still didn't help.

I have an index pool osd this time (363) which generates slow ops since the 
beginning of the recovery until the end of it (the read latency spikes on this 
osd to the sky 150ms).

What seems weird is the pg acting set:
PG_STAT  STATE                                              UP                  
       UP_PRIMARY  ACTING                     ACTING_PRIMARY
26.509   active+recovery_wait+undersized+degraded+remapped              
[363,762,744]         363                  [363,744]             363
26.4dd   active+recovery_wait+undersized+degraded+remapped              
[763,522,363]         763                  [363,522]             363
26.120   active+undersized+degraded+remapped+backfill_wait              
[363,109,274]         363                  [363,109]             363
26.6c       active+recovering+undersized+degraded+remapped              
[363,273,772]         363                  [363,772]             363
26.222   active+recovery_wait+undersized+degraded+remapped              
[597,363,152]         597                  [597,363]             597



Doesn't seem to be good that the acting totally missing the osds which just 
have been updated from octopus to quincy. But with size 3 min size 2 I think 
still should be able to write to those pgs and it should work properly.


BTW: osd_recovery_max_active, osd_recovery_op_priotiry and osd_maxbackfills are 
set to 1.

________________________________
From: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
Sent: Saturday, November 2, 2024 6:45 AM
To: Ceph Users <ceph-users@ceph.io>
Subject: Slow ops during index pool recovery causes cluster performance drop to 
1%

Hi,

I'm updating from octopus to quincy and all in our cluster when index pool 
recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
The recovery takes 1-2 hours/nodes.

What I can see the iowait on the nvme drives which belongs to the index pool is 
pretty high, however the throughput is less than 500MB/s, the iops is less than 
5000/sec.

The index pool is a 3:2 replica pool with 2048pg on 156 osd (1 nvme drive has 4 
osds due to we experienced latency issue with 1 or 2 osd/nvme).

If we consider let's say the nvme drive still slow with these really small 
load, how would that be possible to somehow ease and get rid of this cluster 
performance drop?
If I increase replica to 4-5 would that help? It could tolerate more pg 
slowness maybe?

FYI we have many objects in our cluster, more than 4Billions: objects: 4.06G 
objects, 616 TiB

However I think it should still tolerate cluster recovery without penalty.

What I can see in the slow osd log with default debug value is about 
"get_health_metrics" so far :

2024-11-02T12:38:40.762+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 6 
slow requests (by type [ 'delayed' : 6 ] most affected pool [ 
'hkg.rgw.buckets.index' : 6 ])
2024-11-02T12:38:41.802+0700 7f241bc25640 -1 osd.110 626281 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.3641786447.0:2194661324 26.588 
26:11aa561a:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.503457179.1.10:head 
[call rgw.bucket_list in=47b] snapc 0=[] 
ondisk+read+known_if_redirected+supports_pool_eio e626262)
2024-11-02T12:38:41.802+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 7 
slow requests (by type [ 'delayed' : 7 ] most affected pool [ 
'hkg.rgw.buckets.index' : 7 ])
2024-11-02T12:38:42.782+0700 7f241bc25640 -1 osd.110 626282 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.3641786447.0:2194661324 26.588 
26:11aa561a:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.503457179.1.10:head 
[call rgw.bucket_list in=47b] snapc 0=[] 
ondisk+read+known_if_redirected+supports_pool_eio e626262)
2024-11-02T12:38:42.782+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 7 
slow requests (by type [ 'delayed' : 7 ] most affected pool [ 
'hkg.rgw.buckets.index' : 7 ])
2024-11-02T12:38:43.802+0700 7f241bc25640 -1 osd.110 626282 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.3641786447.0:2194661324 26.588 
26:11aa561a:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.503457179.1.10:head 
[call rgw.bucket_list in=47b] snapc 0=[] 
ondisk+read+known_if_redirected+supports_pool_eio e626262)
2024-11-02T12:38:43.802+0700 7f241bc25640  0 log_channel(cluster) log [WRN] : 7 
slow requests (by type [ 'delayed' : 7 ] most affected pool [ 
'hkg.rgw.buckets.index' : 7 ])

How we also try to make it smoother, after update and machine reboot compaction 
kicks off which generates 30-40 iowait on the node, we prevent with "noup" flag 
to put these osds into the cluster until compaction finished, however when we 
have 0 iowait after compaction, I unset noup so recovery can start which causes 
the above issue. If I wouldn't set noup it would cause even bigger issue.

Thank you for help

________________________________
This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to