Hi Michel, You're probably facing this [1].
Best regards, Frédéric. [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/FWDE4FSNPLKG4SWKT6IMBPYUXJK6VE63/ ----- Le 9 Juin 25, à 11:09, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit : > Apologizes, I realize that I mentioned new deep scrub parameters in > Squid where it should have been "new shallow scrub parameters". But it > doesn change the reasoning. > > Michel > > Le 09/06/2025 à 11:01, Michel Jouvin a écrit : >> >> Hi, >> >> We upgraded one of our production cluster (480 OSDs, 13,5 K PGs, most >> pools EC 9+6) from 18.2.7 to 19.2.2 on May 26. It was healthy when we >> upgraded it and remained so until last Thrusday (June 5) where, doing >> a `ceph -s`, I saw that there were 1 deep scrub and ~1000 scrubs late: >> >> 1 pgs not deep-scrubbed in time >> 1168 pgs not scrubbed in time >> >> Looking again Friday morning (~16 hours later), this number increased >> a lot: >> >> 27 pgs not deep-scrubbed in time >> 3252 pgs not scrubbed in time >> >> Checking our configuration on Friday, we found that osd_max_scrubs was >> set to 1 instead of using the new default since Reef which is 3 >> (probably a leftover of a config change after a problem 18 monts ago). >> We unset the specific value and this led to a reduction of these >> numbers in the next 24 hours (~2700 scrubs late) but since then (2 >> days) it remained stable initially and is now increasing slowly. This >> morning the situation is: >> >> 294 pgs not deep-scrubbed in time >> 3013 pgs not scrubbed in time >> >> My guess is that the real issue is the late scrubs that results in >> many OSD reaching the limit of 3 concurrent scrubs and that it has the >> consequences that some deep scrubs cannot run too (I've in mind that >> the limit applies both to shallow scubs and deep scrubs, am I right?). >> I checked the main cluster logs and didn't find any error or warning, >> related to OSDs like slow ops, slow requests...The only things we have >> spotted through our monitoring system is a dramatic decrease (~75%) of >> IOPS on each OSD server right after the 19.2.2 upgrade but it is not >> necessarily the sign of a problem. I guess it may in particular be a >> consequence of the new deep scrubs parameters, >> osd_shallow_scrub_chunk_min/max, which a probably intended to reduced >> the deep scrub IOPS load. The release notes for Squid also mention a >> change in osd_op_num_shards_hdd and osd_op_num_threads_per_shard_hdd, >> I don't know if they may also have impact. >> >> Up to now, no users has reported any issue so it seems to be a problem >> only with scrubs. I'm wondering where to start looking for an issue or >> anything related to 19.2.2 is already known. We increased the deep >> scrub interval from 10 to 14 days a few days before the upgrade (we >> saw that there was permanently 1 deep scrub late, a different PG all >> the time) and kept the standard 7 day interval for scrubs. Looking at >> the number of scrubs and deep scrubs per day, it doesn't look weird >> (see below). >> >> I guess that if we retart all OSDs we'll clear the problem but we'd >> like to understand what happened and be sure it is not something >> related to Squid, before upgrade our other production cluster. Any >> hint/advice will be highly appreciated. I took a snapshot of `ceph pg >> dump pgs_brief` regularly and I'll try to identify if there are some >> stucked scrubs and what are the OSDs involved but with 500 OSDs and 18 >> OSD servers, it may not be obvious... >> >> Best regards, >> >> Michel >> >> _Distribution of scrubs_ (first number is the number of scrubs during >> the day) >> >> (number increases on June 6 after setting osd_max_scrubs=3) >> >> 295 "2025-05-25 >> 1578 "2025-05-26 >> 300 "2025-05-27 >> 392 "2025-05-28 >> 578 "2025-05-29 >> 707 "2025-05-30 >> 611 "2025-05-31 >> 819 "2025-06-01 >> 679 "2025-06-02 >> 724 "2025-06-03 >> 698 "2025-06-04 >> 726 "2025-06-05 >> 1577 "2025-06-06 >> 1393 "2025-06-07 >> 1962 "2025-06-08 >> 645 "2025-06-09 >> >> _Distribution of deep scrubs per day_ >> >> (number increases on June 6 after setting osd_max_scrubs=3 and starts >> to decrease again on June 8, when the number of late subs increases >> again, probably because we hit the limit of 3 scrubs per OSD) >> >> 22 "2025-05-12 >> 63 "2025-05-13 >> 101 "2025-05-14 >> 127 "2025-05-15 >> 173 "2025-05-16 >> 238 "2025-05-17 >> 305 "2025-05-18 >> 387 "2025-05-19 >> 450 "2025-05-20 >> 564 "2025-05-21 >> 675 "2025-05-22 >> 716 "2025-05-23 >> 871 "2025-05-24 >> 1071 "2025-05-25 >> 801 "2025-05-26 >> 188 "2025-05-27 >> 292 "2025-05-28 >> 335 "2025-05-29 >> 409 "2025-05-30 >> 371 "2025-05-31 >> 514 "2025-06-01 >> 440 "2025-06-02 >> 504 "2025-06-03 >> 478 "2025-06-04 >> 546 "2025-06-05 >> 1132 "2025-06-06 >> 1022 "2025-06-07 >> 662 "2025-06-08 >> 227 "2025-06-09 > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io