[ceph-users] Re: 19.2.2: large number of scrubs suddenly not completed in time

Frédéric Nass Mon, 09 Jun 2025 03:19:10 -0700

Hi Michel,

You're probably facing this [1].


Best regards,
Frédéric.

[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/FWDE4FSNPLKG4SWKT6IMBPYUXJK6VE63/

----- Le 9 Juin 25, à 11:09, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Apologizes, I realize that I mentioned new deep scrub parameters in
> Squid where it should have been "new shallow scrub parameters". But it
> doesn change the reasoning.
> 
> Michel
> 
> Le 09/06/2025 à 11:01, Michel Jouvin a écrit :
>>
>> Hi,
>>
>> We upgraded one of our production cluster (480 OSDs, 13,5 K PGs, most
>> pools EC 9+6) from 18.2.7 to 19.2.2 on May 26. It was healthy when we
>> upgraded it and remained so until last Thrusday (June 5) where, doing
>> a `ceph -s`, I saw that there were 1 deep scrub and ~1000 scrubs late:
>>
>>             1 pgs not deep-scrubbed in time
>>             1168 pgs not scrubbed in time
>>
>> Looking again Friday morning (~16 hours later), this number increased
>> a lot:
>>
>>             27 pgs not deep-scrubbed in time
>>             3252 pgs not scrubbed in time
>>
>> Checking our configuration on Friday, we found that osd_max_scrubs was
>> set to 1 instead of using the new default since Reef which is 3
>> (probably a leftover of a config change after a problem 18 monts ago).
>> We unset the specific value and this led to a reduction of these
>> numbers in the next 24 hours (~2700 scrubs late) but since then (2
>> days) it remained stable initially and is now increasing slowly. This
>> morning the situation is:
>>
>>             294 pgs not deep-scrubbed in time
>>             3013 pgs not scrubbed in time
>>
>> My guess is that the real issue is the late scrubs that results in
>> many OSD reaching the limit of 3 concurrent scrubs and that it has the
>> consequences that some deep scrubs cannot run too (I've in mind that
>> the limit applies both to shallow scubs and deep scrubs, am I right?).
>> I checked the main cluster logs  and didn't find any error or warning,
>> related to OSDs like slow ops, slow requests...The only things we have
>> spotted through our monitoring system is a dramatic decrease (~75%) of
>> IOPS on each OSD server right after the 19.2.2 upgrade but it is not
>> necessarily the sign of a problem. I guess it may in particular be a
>> consequence of the new deep scrubs parameters,
>> osd_shallow_scrub_chunk_min/max, which a probably intended to reduced
>> the deep scrub IOPS load. The release notes for Squid also mention a
>> change in osd_op_num_shards_hdd and osd_op_num_threads_per_shard_hdd,
>> I don't know if they may also have impact.
>>
>> Up to now, no users has reported any issue so it seems to be a problem
>> only with scrubs. I'm wondering where to start looking for an issue or
>> anything related to 19.2.2 is already known. We increased the deep
>> scrub interval from 10 to 14 days  a few days before the upgrade (we
>> saw that there was permanently 1 deep scrub late, a different PG all
>> the time) and kept the standard 7 day interval for scrubs. Looking at
>> the number of scrubs and deep scrubs per day, it doesn't look weird
>> (see below).
>>
>> I guess that if we retart all OSDs we'll clear the problem but we'd
>> like to understand what happened and be sure it is not something
>> related to Squid, before upgrade our other production cluster. Any
>> hint/advice will be highly appreciated. I took a snapshot of `ceph pg
>> dump pgs_brief` regularly and I'll try to identify if there are some
>> stucked scrubs and what are the OSDs involved but with 500 OSDs and 18
>> OSD servers, it may not be obvious...
>>
>> Best regards,
>>
>> Michel
>>
>> _Distribution of scrubs_ (first number is the number of scrubs during
>> the day)
>>
>> (number increases on June 6 after setting osd_max_scrubs=3)
>>
>>     295 "2025-05-25
>>    1578 "2025-05-26
>>     300 "2025-05-27
>>     392 "2025-05-28
>>     578 "2025-05-29
>>     707 "2025-05-30
>>     611 "2025-05-31
>>     819 "2025-06-01
>>     679 "2025-06-02
>>     724 "2025-06-03
>>     698 "2025-06-04
>>     726 "2025-06-05
>>    1577 "2025-06-06
>>    1393 "2025-06-07
>>    1962 "2025-06-08
>>     645 "2025-06-09
>>
>> _Distribution of deep scrubs per day_
>>
>> (number increases on June 6 after setting osd_max_scrubs=3 and starts
>> to decrease again on June 8, when the number of late subs increases
>> again, probably because we hit the limit of 3 scrubs per OSD)
>>
>>      22 "2025-05-12
>>      63 "2025-05-13
>>     101 "2025-05-14
>>     127 "2025-05-15
>>     173 "2025-05-16
>>     238 "2025-05-17
>>     305 "2025-05-18
>>     387 "2025-05-19
>>     450 "2025-05-20
>>     564 "2025-05-21
>>     675 "2025-05-22
>>     716 "2025-05-23
>>     871 "2025-05-24
>>    1071 "2025-05-25
>>     801 "2025-05-26
>>     188 "2025-05-27
>>     292 "2025-05-28
>>     335 "2025-05-29
>>     409 "2025-05-30
>>     371 "2025-05-31
>>     514 "2025-06-01
>>     440 "2025-06-02
>>     504 "2025-06-03
>>     478 "2025-06-04
>>     546 "2025-06-05
>>    1132 "2025-06-06
>>    1022 "2025-06-07
>>     662 "2025-06-08
>>     227 "2025-06-09
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 19.2.2: large number of scrubs suddenly not completed in time

Reply via email to