[ceph-users] Re: ceph health mute behavior

Frédéric Nass Wed, 25 Jun 2025 02:23:06 -0700

Hi Eugen,

Reading the code, the muted alert was cleared because it was non-sticky and the 
number of affected PGs increased (which was decided to be a good reason to 
alert the admin).


Have you tried to use the --sticky argument on the 'ceph health mute' command?

Cheers,
Frédéric.

----- Le 25 Juin 25, à 9:21, Eugen Block ebl...@nde.ag a écrit :

> Hi,
> 
> I'm trying to understand the "ceph health mute" behavior. In this
> case, I'm referring to the warning PG_NOT_DEEP_SCRUBBED. If you mute
> it for a week and the cluster continues deep-scrubbing, the "mute"
> will clear at some point although there are still PGs not
> deep-scrubbed in time warnings. I could verify this in a tiny lab with
> 19.2.2, setting osd_deep_scrub_interval to 10 minutes, the warning
> pops up. Then I mute that warning, issue deep-scrubs for several
> pools, and at some point I see this in the mon log:
> 
> Jun 25 08:53:28 host1 ceph-mon[823315]: log_channel(cluster) log [WRN]
> : Health check update: 61 pgs not deep-scrubbed in time
> (PG_NOT_DEEP_SCRUBBED)
> Jun 25 08:53:28 host1 ceph-mon[823315]: Health check update: 61 pgs
> not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
> Jun 25 08:53:29 host1 ceph-mon[823315]: pgmap v164176: 389 pgs: 389
> active+clean; 428 MiB data, 57 GiB used, 279 GiB / 336 GiB avail
> ...
> Jun 25 08:53:31 host1 ceph-mon[823315]: log_channel(cluster) log [INF]
> : Health alert mute PG_NOT_DEEP_SCRUBBED cleared (count increased from
> 60 to 61)
> Jun 25 08:53:31 host1 ceph-mon[823315]: Health alert mute
> PG_NOT_DEEP_SCRUBBED cleared (count increased from 60 to 61)
> 
> 
> I don't really understand what the code does [0] (I'm not a dev):
> 
> ---snip---
>     if (!p->second.sticky) {
>       auto q = all.checks.find(p->first);
>       if (q == all.checks.end()) {
>       mon.clog->info() << "Health alert mute " << p->first
>                         << " cleared (health alert cleared)";
>       p = pending_mutes.erase(p);
>       changed = true;
>       continue;
>       }
>       if (p->second.count) {
>       // count-based mute
>       if (q->second.count > p->second.count) {
>         mon.clog->info() << "Health alert mute " << p->first
>                           << " cleared (count increased from " << 
> p->second.count
>                           << " to " << q->second.count << ")";
>         p = pending_mutes.erase(p);
>         changed = true;
>         continue;
> ---snip---
> 
> Could anyone shed some light what I'm not understanding? Why would the
> mute clear although there are still PGs not deep-scrubbed?
> 
> Thanks!
> Eugen
> 
> [0]
> https://github.com/ceph/ceph/blob/d78ffd1247d6cef5cbd829e77204185dc0d3a8ba/src/mon/HealthMonitor.cc#L431
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph health mute behavior

Reply via email to