On Wed, Jun 25, 2025, 3:54 PM Eugen Block <ebl...@nde.ag> wrote: > Actually, this is not the result of an upgrade but of two disk > failures and the resulting backfill. The scrub performance is alright. > :-) > > > Zitat von Lukasz Borek <luk...@borek.org.pl>: > > > Looks like I'm not alone in drop off scrub performance after last > update? :) > > > > > > Łukasz Borek > > luk...@borek.org.pl > > > > > > On Wed, 25 Jun 2025 at 11:58, Eugen Block <ebl...@nde.ag> wrote: > > > >> Thanks Frédéric. > >> The customer found the sticky flag, too. I must admit, I haven't used > >> the mute command too often yet, usually I try to get to the bottom of > >> a warning and rather fix the underlying issue. :-D > >> So the mute clears if the number increases: > >> > >> >> if (q->second.count > p->second.count) > >> > >> That makes sense, and I agree that an admin might want to know about > >> that. Then this is resolved for me, thanks for the quick response! > >> > >> Eugen > >> > >> Zitat von Frédéric Nass <frederic.n...@univ-lorraine.fr>: > >> > >> > Hi Eugen, > >> > > >> > Reading the code, the muted alert was cleared because it was > >> > non-sticky and the number of affected PGs increased (which was > >> > decided to be a good reason to alert the admin). > >> > > >> > Have you tried to use the --sticky argument on the 'ceph health > >> > mute' command? > >> > > >> > Cheers, > >> > Frédéric. > >> > > >> > ----- Le 25 Juin 25, à 9:21, Eugen Block ebl...@nde.ag a écrit : > >> > > >> >> Hi, > >> >> > >> >> I'm trying to understand the "ceph health mute" behavior. In this > >> >> case, I'm referring to the warning PG_NOT_DEEP_SCRUBBED. If you mute > >> >> it for a week and the cluster continues deep-scrubbing, the "mute" > >> >> will clear at some point although there are still PGs not > >> >> deep-scrubbed in time warnings. I could verify this in a tiny lab > with > >> >> 19.2.2, setting osd_deep_scrub_interval to 10 minutes, the warning > >> >> pops up. Then I mute that warning, issue deep-scrubs for several > >> >> pools, and at some point I see this in the mon log: > >> >> > >> >> Jun 25 08:53:28 host1 ceph-mon[823315]: log_channel(cluster) log > [WRN] > >> >> : Health check update: 61 pgs not deep-scrubbed in time > >> >> (PG_NOT_DEEP_SCRUBBED) > >> >> Jun 25 08:53:28 host1 ceph-mon[823315]: Health check update: 61 pgs > >> >> not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED) > >> >> Jun 25 08:53:29 host1 ceph-mon[823315]: pgmap v164176: 389 pgs: 389 > >> >> active+clean; 428 MiB data, 57 GiB used, 279 GiB / 336 GiB avail > >> >> ... > >> >> Jun 25 08:53:31 host1 ceph-mon[823315]: log_channel(cluster) log > [INF] > >> >> : Health alert mute PG_NOT_DEEP_SCRUBBED cleared (count increased > from > >> >> 60 to 61) > >> >> Jun 25 08:53:31 host1 ceph-mon[823315]: Health alert mute > >> >> PG_NOT_DEEP_SCRUBBED cleared (count increased from 60 to 61) > >> >> > >> >> > >> >> I don't really understand what the code does [0] (I'm not a dev): > >> >> > >> >> ---snip--- > >> >> if (!p->second.sticky) { > >> >> auto q = all.checks.find(p->first); > >> >> if (q == all.checks.end()) { > >> >> mon.clog->info() << "Health alert mute " << p->first > >> >> << " cleared (health alert cleared)"; > >> >> p = pending_mutes.erase(p); > >> >> changed = true; > >> >> continue; > >> >> } > >> >> if (p->second.count) { > >> >> // count-based mute > >> >> if (q->second.count > p->second.count) { > >> >> mon.clog->info() << "Health alert mute " << p->first > >> >> << " cleared (count increased from " << > >> p->second.count > >> >> << " to " << q->second.count << ")"; > >> >> p = pending_mutes.erase(p); > >> >> changed = true; > >> >> continue; > >> >> ---snip--- > >> >> > >> >> Could anyone shed some light what I'm not understanding? Why would > the > >> >> mute clear although there are still PGs not deep-scrubbed? > >> >> > >> >> Thanks! > >> >> Eugen > >> >> > >> >> [0] > >> >> > >> > https://github.com/ceph/ceph/blob/d78ffd1247d6cef5cbd829e77204185dc0d3a8ba/src/mon/HealthMonitor.cc#L431 > >> >> > >> >> _______________________________________________ > >> >> ceph-users mailing list -- ceph-users@ceph.io > >> >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io
If you don't pass the --sticky flag to health mute, then the alert will be considered unmuted when the condition clears (e.g. if all PGs becomes scrubbed in a sufficient timespan). If you pass the --sticky flag, then the alarm remains muted irrespective of whether or not the condition clears at some point or not in the future until you explicitly unmute it. Cheers, Tyler _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io