[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

Bandelow, Gunnar Fri, 22 Mar 2024 00:19:56 -0700

Hi Michael,

i think yesterday i found the culprit in my case.


After inspecting "ceph pg dump" and especially the column
"last_scrub_duration". I found, that any PG without proper scrubbing
was located on one of three OSDs (and all these OSDs share the same
SSD for their DB). I put them on "out" and now after backfill and
remapping everything seems to be fine. 


Only the log is still flooded with "scrub starts" and i have no clue
why these OSDs are causing the problems.
Will investigate further.


Best regards,
Gunnar

===================================


 Gunnar Bandelow
 Universitätsrechenzentrum (URZ)
 Universität Greifswald
 Felix-Hausdorff-Straße 18
 17489 Greifswald
 Germany


 Tel.: +49 3834 420 1450

--- Original Nachricht ---
Betreff: [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep
scrubbed for 1 month
Von: "Michel Jouvin" 
An: ceph-users@ceph.io
Datum: 21-03-2024 23:40






Hi,

Today we decided to upgrade from 18.2.0 to 18.2.2. No real hope of a 
direct impact (nothing in the change log related to something similar)

but at least all daemons were restarted so we thought that may be this

will clear the problem at least temporarily. Unfortunately it has not 
been the case. The same pages are still stuck, despite continuous 
activity of scrubbing/deep scrubbing in the cluster...

I'm happy to provide more information if somebody tells me what to
look 
at...

Cheers,

Michel

Le 21/03/2024 à 14:40, Bernhard Krieger a écrit :
> Hi,
>
> i have the same issues.
> Deep scrub havent finished the jobs on some PGs.
>
> Using ceph 18.2.2.
> Initial installed version was 18.0.0
>
>
> In the logs i see a lot of scrub/deep-scrub starts
>
> Mar 21 14:21:09 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:21:10 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:21:17 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:21:19 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 scrubstarts
> Mar 21 14:21:27 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 14.6 scrubstarts
> Mar 21 14:21:30 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 10.c deep-scrubstarts
> Mar 21 14:21:35 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 12.3 deep-scrubstarts
> Mar 21 14:21:41 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 6.0 scrubstarts
> Mar 21 14:21:44 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 8.5 deep-scrubstarts
> Mar 21 14:21:45 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.66 deep-scrubstarts
> Mar 21 14:21:49 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.30 deep-scrubstarts
> Mar 21 14:21:50 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:21:52 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:21:54 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:21:55 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 scrubstarts
> Mar 21 14:21:58 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 14.6 scrubstarts
> Mar 21 14:22:01 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 10.c deep-scrubstarts
> Mar 21 14:22:04 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 12.3 scrubstarts
> Mar 21 14:22:13 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 6.0 scrubstarts
> Mar 21 14:22:15 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 8.5 deep-scrubstarts
> Mar 21 14:22:20 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.66 deep-scrubstarts
> Mar 21 14:22:27 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.30 scrubstarts
> Mar 21 14:22:30 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:22:32 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:22:33 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:22:35 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 deep-scrubstarts
> Mar 21 14:22:37 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 14.6 scrubstarts
> Mar 21 14:22:38 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 10.c scrubstarts
> Mar 21 14:22:39 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 12.3 scrubstarts
> Mar 21 14:22:41 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 6.0 deep-scrubstarts
> Mar 21 14:22:43 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 8.5 deep-scrubstarts
> Mar 21 14:22:46 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.66 deep-scrubstarts
> Mar 21 14:22:49 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 5.30 scrubstarts
> Mar 21 14:22:55 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.b deep-scrubstarts
> Mar 21 14:22:57 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1a deep-scrubstarts
> Mar 21 14:22:58 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 13.1c deep-scrubstarts
> Mar 21 14:23:03 ceph-node10 ceph-osd[3804193]: log_channel(cluster) 
> log [DBG] : 11.1 deep-scrubstarts
>
>
>
> *
> *The amount of scrubbed/deep-scrubbed pgs changes every few seconds.
>
> [root@ceph-node10 ~]# ceph -s | grep active+clean
>    pgs:     214 active+clean
>             50 active+clean+scrubbing+deep
>             25 active+clean+scrubbing
> [root@ceph-node10 ~]# ceph -s | grep active+clean
>    pgs:     208 active+clean
>             53 active+clean+scrubbing+deep
>             28 active+clean+scrubbing
> [root@ceph-node10 ~]# ceph -s | grep active+clean
>    pgs:     208 active+clean
>             53 active+clean+scrubbing+deep
>             28 active+clean+scrubbing
> [root@ceph-node10 ~]# ceph -s | grep active+clean
>    pgs:     207 active+clean
>             54 active+clean+scrubbing+deep
>             28 active+clean+scrubbing
> [root@ceph-node10 ~]# ceph -s | grep active+clean
>    pgs:     202 active+clean
>             56 active+clean+scrubbing+deep
>             31 active+clean+scrubbing
> [root@ceph-node10 ~]# ceph -s | grep active+clean
>    pgs:     213 active+clean
>             45 active+clean+scrubbing+deep
>             31 active+clean+scrubbing
>
> ceph pg dump showing PGs which are not deep scrubbed since january.
> Some PGs deep scrubbing  over 700000 seconds.
>
> *[ceph: root@ceph-node10 /]#  ceph pg dump pgs | grep -e 'scrubbing
f'
> 5.6e      221223                   0
        0          0        0 
>  927795290112            0           0  4073
     3000      4073 
>  active+clean+scrubbing+deep  2024-03-20T01:07:21.196293+
> 0000  128383'15766927  128383:20517419   [2,4,18,16,14,21]
          2 
>   [2,4,18,16,14,21]               2
 125519'12328877 
>  2024-01-23T11:25:35.503811+0000  124844'11873951  2024-01-21T22:
> 24:12.620693+0000              0
                   5  deep scrubbing 
> for 270790s
                                            53772

>                0
> 5.6c      221317                   0
        0          0        0 
>  928173256704            0           0  6332
        0      6332 
>  active+clean+scrubbing+deep  2024-03-18T09:29:29.233084+
> 0000  128382'15788196  128383:20727318     [6,9,12,14,1,4]
          6 
>     [6,9,12,14,1,4]               6
 127180'14709746 
>  2024-03-06T12:47:57.741921+0000  124817'11821502  2024-01-20T20:
> 59:40.566384+0000              0
               13452  deep scrubbing 
> for 273519s
                                           122803

>                0
> 5.6a      221325                   0
        0          0        0 
>  928184565760            0           0  4649
     3000      4649 
>  active+clean+scrubbing+deep  2024-03-13T03:48:54.065125+
> 0000  128382'16031499  128383:21221685     [13,11,1,2,9,8]
         13 
>     [13,11,1,2,9,8]              13
 127181'14915404 
>  2024-03-06T13:16:58.635982+0000  125967'12517899  2024-01-28T09:
> 13:08.276930+0000              0
               10078  deep scrubbing 
> for 726001s
                                           184819

>                0
> 5.54      221050                   0
        0          0        0 
>  927036203008            0           0  4864
     3000      4864 
>  active+clean+scrubbing+deep  2024-03-18T00:17:48.086231+
> 0000  128383'15584012  128383:20293678  [0,20,18,19,11,12]
          0 
>  [0,20,18,19,11,12]               0  127195'14651908

>  2024-03-07T09:22:31.078448+0000  124816'11813857  2024-01-20T16:
> 43:15.755200+0000              0
                9808  deep scrubbing 
> for 306667s
                                           142126

>                0
> 5.47      220849                   0
        0          0        0 
>  926233448448            0           0  5592
        0      5592 
>  active+clean+scrubbing+deep  2024-03-12T08:10:39.413186+
> 0000  128382'15653864  128383:20403071  [16,15,20,0,13,21]
         16 
>  [16,15,20,0,13,21]              16  127183'14600433 
>  2024-03-06T18:21:03.057165+0000  124809'11792397  2024-01-20T05:
> 27:07.617799+0000              0
               13066  deep scrubbing 
> for 796697s
                                           209193

>                0
> dumped pgs
>
>
> *
>
>
> regards
> Bernhard
>
>
>
>
>
>
> On 20/03/2024 21:12, Bandelow, Gunnar wrote:
>> Hi,
>>
>> i just wanted to mention, that i am running a cluster with reef 
>> 18.2.1 with the same issue.
>>
>> 4 PGs start to deepscrub but dont finish since mid february. In the

>> pg dump they are shown as scheduled for deep scrub. They sometimes 
>> change their status from active+clean to
active+clean+scrubbing+deep 
>> and back.
>>
>> Best regards,
>> Gunnar
>>
>> =======================================================
>>
>> Gunnar Bandelow
>> Universitätsrechenzentrum (URZ)
>> Universität Greifswald
>> Felix-Hausdorff-Straße 18
>> 17489 Greifswald
>> Germany
>>
>> Tel.: +49 3834 420 1450
>>
>>
>>
>>
>> --- Original Nachricht ---
>> *Betreff: *[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep 
>> scrubbed for 1 month
>> *Von: *"Michel Jouvin" > >
>> *An: *ceph-users@ceph.io 
>> *Datum: *20-03-2024 20:00
>>
>>
>>
>>     Hi Rafael,
>>
>>     Good to know I am not alone!
>>
>>     Additional information ~6h after the OSD restart: over the
20 PGs
>>     impacted, 2 have been processed successfully... I don't have
a clear
>>     picture on how Ceph prioritize the scrub of one PG over
another, I
>>     had
>>     thought that the oldest/expired scrubs are taken first but
it may
>>     not be
>>     the case. Anyway, I have seen a very significant decrese of
the 
>> scrub
>>     activity this afternoon and the cluster is not loaded at all
>>     (almost no
>>     users yet)...
>>
>>     Michel
>>
>>     Le 20/03/2024 à 17:55, quag...@bol.com.br
>>      a écrit :
>>     > Hi,
>>     >      I upgraded a cluster 2 weeks ago here. The
situation is the
>>     same
>>     > as Michel.
>>     >      A lot of PGs no scrubbed/deep-scrubed.
>>     >
>>     > Rafael.
>>     >
>>     > _______________________________________________
>>     > ceph-users mailing list -- ceph-users@ceph.io
>>     
>>     > To unsubscribe send an email to ceph-users-le...@ceph.io
>>     
>>     _______________________________________________
>>     ceph-users mailing list -- ceph-users@ceph.io
>>     
>>     To unsubscribe send an email to ceph-users-le...@ceph.io
>>     
>>
>>
>> _______________________________________________
>> ceph-users mailing list --ceph-users@ceph.io
>> To unsubscribe send an email toceph-users-le...@ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

Reply via email to