[ceph-users] Re: ceph health mute behavior

Tyler Stachecki Wed, 25 Jun 2025 14:22:24 -0700

On Wed, Jun 25, 2025, 3:54 PM Eugen Block <ebl...@nde.ag> wrote:

> Actually, this is not the result of an upgrade but of two disk
> failures and the resulting backfill. The scrub performance is alright.
> :-)
>
>
> Zitat von Lukasz Borek <luk...@borek.org.pl>:
>
> > Looks like I'm not alone in drop off scrub performance after last
> update? :)
> >
> >
> > Łukasz Borek
> > luk...@borek.org.pl
> >
> >
> > On Wed, 25 Jun 2025 at 11:58, Eugen Block <ebl...@nde.ag> wrote:
> >
> >> Thanks Frédéric.
> >> The customer found the sticky flag, too. I must admit, I haven't used
> >> the mute command too often yet, usually I try to get to the bottom of
> >> a warning and rather fix the underlying issue. :-D
> >> So the mute clears if the number increases:
> >>
> >> >>      if (q->second.count > p->second.count)
> >>
> >> That makes sense, and I agree that an admin might want to know about
> >> that. Then this is resolved for me, thanks for the quick response!
> >>
> >> Eugen
> >>
> >> Zitat von Frédéric Nass <frederic.n...@univ-lorraine.fr>:
> >>
> >> > Hi Eugen,
> >> >
> >> > Reading the code, the muted alert was cleared because it was
> >> > non-sticky and the number of affected PGs increased (which was
> >> > decided to be a good reason to alert the admin).
> >> >
> >> > Have you tried to use the --sticky argument on the 'ceph health
> >> > mute' command?
> >> >
> >> > Cheers,
> >> > Frédéric.
> >> >
> >> > ----- Le 25 Juin 25, à 9:21, Eugen Block ebl...@nde.ag a écrit :
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm trying to understand the "ceph health mute" behavior. In this
> >> >> case, I'm referring to the warning PG_NOT_DEEP_SCRUBBED. If you mute
> >> >> it for a week and the cluster continues deep-scrubbing, the "mute"
> >> >> will clear at some point although there are still PGs not
> >> >> deep-scrubbed in time warnings. I could verify this in a tiny lab
> with
> >> >> 19.2.2, setting osd_deep_scrub_interval to 10 minutes, the warning
> >> >> pops up. Then I mute that warning, issue deep-scrubs for several
> >> >> pools, and at some point I see this in the mon log:
> >> >>
> >> >> Jun 25 08:53:28 host1 ceph-mon[823315]: log_channel(cluster) log
> [WRN]
> >> >> : Health check update: 61 pgs not deep-scrubbed in time
> >> >> (PG_NOT_DEEP_SCRUBBED)
> >> >> Jun 25 08:53:28 host1 ceph-mon[823315]: Health check update: 61 pgs
> >> >> not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
> >> >> Jun 25 08:53:29 host1 ceph-mon[823315]: pgmap v164176: 389 pgs: 389
> >> >> active+clean; 428 MiB data, 57 GiB used, 279 GiB / 336 GiB avail
> >> >> ...
> >> >> Jun 25 08:53:31 host1 ceph-mon[823315]: log_channel(cluster) log
> [INF]
> >> >> : Health alert mute PG_NOT_DEEP_SCRUBBED cleared (count increased
> from
> >> >> 60 to 61)
> >> >> Jun 25 08:53:31 host1 ceph-mon[823315]: Health alert mute
> >> >> PG_NOT_DEEP_SCRUBBED cleared (count increased from 60 to 61)
> >> >>
> >> >>
> >> >> I don't really understand what the code does [0] (I'm not a dev):
> >> >>
> >> >> ---snip---
> >> >>     if (!p->second.sticky) {
> >> >>       auto q = all.checks.find(p->first);
> >> >>       if (q == all.checks.end()) {
> >> >>      mon.clog->info() << "Health alert mute " << p->first
> >> >>                        << " cleared (health alert cleared)";
> >> >>      p = pending_mutes.erase(p);
> >> >>      changed = true;
> >> >>      continue;
> >> >>       }
> >> >>       if (p->second.count) {
> >> >>      // count-based mute
> >> >>      if (q->second.count > p->second.count) {
> >> >>        mon.clog->info() << "Health alert mute " << p->first
> >> >>                          << " cleared (count increased from " <<
> >> p->second.count
> >> >>                          << " to " << q->second.count << ")";
> >> >>        p = pending_mutes.erase(p);
> >> >>        changed = true;
> >> >>        continue;
> >> >> ---snip---
> >> >>
> >> >> Could anyone shed some light what I'm not understanding? Why would
> the
> >> >> mute clear although there are still PGs not deep-scrubbed?
> >> >>
> >> >> Thanks!
> >> >> Eugen
> >> >>
> >> >> [0]
> >> >>
> >>
> https://github.com/ceph/ceph/blob/d78ffd1247d6cef5cbd829e77204185dc0d3a8ba/src/mon/HealthMonitor.cc#L431
> >> >>
> >> >> _______________________________________________
> >> >> ceph-users mailing list -- ceph-users@ceph.io
> >> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



If you don't pass the --sticky flag to health mute, then the alert will be
considered unmuted when the condition clears (e.g. if all PGs becomes
scrubbed in a sufficient timespan). If you pass the --sticky flag, then the
alarm remains muted irrespective of whether or not the condition clears at
some point or not in the future until you explicitly unmute it.

Cheers,
Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph health mute behavior

Reply via email to