Hi all,

Just came back from this years Cephalocon and managed to get a quick chat with 
Ronen regarding this issue. He had a great presentation[1, 2] on the upcoming 
changes to scrubbing in Tentacle as well as some changes already made in Squid 
release.
The primary suspect here is the mclock scheduler and the way replica 
reservations are made with since 19.2.0. Regular scrubs begin by the primary 
requesting all acting-set replicas to allow the scrub to continue, each replica 
either grants the request immediately or queues it. As I understand previous 
releases instead of queuing would send a simple deny on the spot in case of 
thinned resources (that happens when the scrub map is asked for from the acting 
set members, but I might be wrong). For some reason with mclock this can lead 
to acting sets constantly queuing these scrub requests and never actually 
completing.
As for the configuraiton goes: in Squid osd_scrub_cost config that has been 
increased to 52428800 for some reason. I'm having a hard time finding previous 
values but [3] redhat docs have this value set at 50 << 20. Unless the whole 
logic/calculation has changed such an abyssmal value will simply never allow 
resources to be granted with mclock.
Another suspect is osd_scrub_event_cost which has been set to 4096. Once again 
having a hard time to find any previous version values for it to compare.

One thing we've found that there is now a config option 
osd_scrub_disable_reservation_queuing (default - false): "When set - scrub 
replica reservations are responded to immediately, with either success or 
failure (the pre-Squid version behaviour). This configuration option is 
introduced to support mixed-version clusters and debugging, and will be removed 
in the next release." My guess is that setting this to true would simply return 
scrubbing options back to Reef and previous releases.

To keep all the work done with scrubbing changes in place we will try reducing 
osd_scrub_cost to a much lower value (50 or even less) and check if that helps 
our case. If not, we will reduce osd_scrub_event_cost as well as we're not sure 
at this point which one of these have the direct impact. 
If that wont help we will have to set osd_scrub_disable_reservation_queuing to 
true, but that will leave us simply with an old way scrubs are done (not cool - 
we want the fancy new way). If that wont help we will have to start thinking of 
switching to wpq instead of mclock, which is also not that cool looking into 
the future of Ceph. 
 
I'll keep the mailing list (and tracker) updated with our findings.

Best,
Laimis J.


1 - 
https://ceph2024.sched.com/event/1ktWh/the-scrub-type-to-limitations-matrix-ronen-friedman-ibm
2 - https://static.sched.com/hosted_files/ceph2024/08/ceph24_main%20%284%29.pdf
3 - 
https://docs.redhat.com/en/documentation/red_hat_ceph_storage/2/html/configuration_guide/osd_configuration_reference#scrubbing
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to