[ceph-users] Re: ceph health mute behavior

Eugen Block Wed, 25 Jun 2025 12:38:51 -0700

Actually, this is not the result of an upgrade but of two diskfailures and the resulting backfill. The scrub performance is alright.:-)


Zitat von Lukasz Borek <luk...@borek.org.pl>:

Looks like I'm not alone in drop off scrub performance after last update? :)


Łukasz Borek
luk...@borek.org.pl


On Wed, 25 Jun 2025 at 11:58, Eugen Block <ebl...@nde.ag> wrote:

Thanks Frédéric.
The customer found the sticky flag, too. I must admit, I haven't used
the mute command too often yet, usually I try to get to the bottom of
a warning and rather fix the underlying issue. :-D
So the mute clears if the number increases:

>>      if (q->second.count > p->second.count)

That makes sense, and I agree that an admin might want to know about
that. Then this is resolved for me, thanks for the quick response!

Eugen

Zitat von Frédéric Nass <frederic.n...@univ-lorraine.fr>:

> Hi Eugen,
>
> Reading the code, the muted alert was cleared because it was
> non-sticky and the number of affected PGs increased (which was
> decided to be a good reason to alert the admin).
>
> Have you tried to use the --sticky argument on the 'ceph health
> mute' command?
>
> Cheers,
> Frédéric.
>
> ----- Le 25 Juin 25, à 9:21, Eugen Block ebl...@nde.ag a écrit :
>
>> Hi,
>>
>> I'm trying to understand the "ceph health mute" behavior. In this
>> case, I'm referring to the warning PG_NOT_DEEP_SCRUBBED. If you mute
>> it for a week and the cluster continues deep-scrubbing, the "mute"
>> will clear at some point although there are still PGs not
>> deep-scrubbed in time warnings. I could verify this in a tiny lab with
>> 19.2.2, setting osd_deep_scrub_interval to 10 minutes, the warning
>> pops up. Then I mute that warning, issue deep-scrubs for several
>> pools, and at some point I see this in the mon log:
>>
>> Jun 25 08:53:28 host1 ceph-mon[823315]: log_channel(cluster) log [WRN]
>> : Health check update: 61 pgs not deep-scrubbed in time
>> (PG_NOT_DEEP_SCRUBBED)
>> Jun 25 08:53:28 host1 ceph-mon[823315]: Health check update: 61 pgs
>> not deep-scrubbed in time (PG_NOT_DEEP_SCRUBBED)
>> Jun 25 08:53:29 host1 ceph-mon[823315]: pgmap v164176: 389 pgs: 389
>> active+clean; 428 MiB data, 57 GiB used, 279 GiB / 336 GiB avail
>> ...
>> Jun 25 08:53:31 host1 ceph-mon[823315]: log_channel(cluster) log [INF]
>> : Health alert mute PG_NOT_DEEP_SCRUBBED cleared (count increased from
>> 60 to 61)
>> Jun 25 08:53:31 host1 ceph-mon[823315]: Health alert mute
>> PG_NOT_DEEP_SCRUBBED cleared (count increased from 60 to 61)
>>
>>
>> I don't really understand what the code does [0] (I'm not a dev):
>>
>> ---snip---
>>     if (!p->second.sticky) {
>>       auto q = all.checks.find(p->first);
>>       if (q == all.checks.end()) {
>>      mon.clog->info() << "Health alert mute " << p->first
>>                        << " cleared (health alert cleared)";
>>      p = pending_mutes.erase(p);
>>      changed = true;
>>      continue;
>>       }
>>       if (p->second.count) {
>>      // count-based mute
>>      if (q->second.count > p->second.count) {
>>        mon.clog->info() << "Health alert mute " << p->first
>>                          << " cleared (count increased from " <<
p->second.count
>>                          << " to " << q->second.count << ")";
>>        p = pending_mutes.erase(p);
>>        changed = true;
>>        continue;
>> ---snip---
>>
>> Could anyone shed some light what I'm not understanding? Why would the
>> mute clear although there are still PGs not deep-scrubbed?
>>
>> Thanks!
>> Eugen
>>
>> [0]
>>
https://github.com/ceph/ceph/blob/d78ffd1247d6cef5cbd829e77204185dc0d3a8ba/src/mon/HealthMonitor.cc#L431
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph health mute behavior

Reply via email to