[ceph-users] Re: HEALTH_ERR: 1 MDSs report damaged metadata - damage_type=dentry

Anthony D'Atri Thu, 17 Apr 2025 10:07:47 -0700

HPE rebadges drives from manufacturers.  A quick search supports the idea that 
this SKU is fulfilled at least partly by Kioxia, so not likely a PLP issue.



> On Apr 17, 2025, at 11:39 AM, Christophe DIARRA <christophe.dia...@idris.fr> 
> wrote:
> 
> Hello David,
> 
> The SSD model is VO007680JWZJL.
> 
> I will delay the 'ceph tell mds.cfs_irods_test:0 damage rm 241447932' for the 
> moment. If any other solution is found I will be obliged to use this command.
> 
> I found 'dentry' in the logs when the cephfs cluster started:
> 
>> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.cfs_irods_test.mon-02.awuygq 
>> Updating MDS map to version 15613 from mon.2
>> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map i am now 
>> mds.0.15612
>> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map state 
>> change up:starting --> up:active
>> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 active_start
>> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir2) loaded 
>> already *corrupt dentry*: [dentry #0x1/testdir2 [2,head] rep@0.0 NULL 
>> (dversion lock) pv=0 v=4442 ino=(n
>> il) state=0 0x5617e18c8280]
>> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir1) loaded 
>> already *corrupt dentry*: [dentry #0x1/testdir1 [2,head] rep@0.0 NULL 
>> (dversion lock) pv=0 v=4442 ino=(n
>> il) state=0 0x5617e18c8500]
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1 filesystem is 
>> offline (MDS_ALL_DOWN)
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1 filesystem is 
>> online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: from='client.? 
>> xx.xx.xx.8:0/3820885518' entity='client.admin' cmd='[{"prefix": "fs set", 
>> "fs_name": "cfs_irods_test", "var": "down", "val":
>> "false"}]': finished
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon 
>> mds.cfs_irods_test.mon-02.awuygq assigned to filesystem cfs_irods_test as 
>> rank 0 (now has 1 ranks)
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared: MDS_ALL_DOWN 
>> (was: 1 filesystem is offline)
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared: 
>> MDS_UP_LESS_THAN_MAX (was: 1 filesystem is online with fewer MDS than 
>> max_mds)
>> Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon 
>> mds.cfs_irods_test.mon-02.awuygq is now active in filesystem cfs_irods_test 
>> as rank 0
>> Apr 16 17:29:54 mon-02 ceph-mgr[2444]: log_channel(cluster) log [DBG] : 
>> pgmap v1721: 4353 pgs: 4346 active+clean, 7 active+clean+scrubbing+deep; 3.9 
>> TiB data, 417 TiB used, 6.4 P
>> iB / 6.8 PiB avail; 1.4 KiB/s rd, 1 op/s
>> 
> If you need more extract from the log file please let me know.
> 
> Thanks for your help,
> 
> Christophe
> 
> 
> 
> On 17/04/2025 13:39, David C. wrote:
>> If I'm not mistaken, this is a fairly rare situation.
>> 
>> The fact that it's the result of a power outage makes me think of a bad SSD 
>> (like "S... Pro").
>> 
>> Does a grep of the dentry id in the MDS logs return anything?
>> Maybe some interesting information around this grep
>> 
>> In the heat of the moment, I have no other idea than to delete the dentry.
>> 
>> ceph tell mds.cfs_irods_test:0 damage rm 241447932
>> 
>> However, in production, this results in the content (of dir /testdir[12]) 
>> being abandoned.
>> 
>> 
>> Le jeu. 17 avr. 2025 à 12:44, Christophe DIARRA <christophe.dia...@idris.fr> 
>> a écrit :
>> 
>>    Hello David,
>> 
>>    Thank you for the tip about the scrubbing. I have tried the
>>    commands found in the documentation but it seems to have no effect:
>> 
>>    [root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub start / 
>> recursive,repair,force*
>>    2025-04-17T12:07:20.958+0200 7fd4157fa640  0 client.86301 ms_handle_reset 
>> on v2:130.84.80.10:6800/3218663047 2025-04-17T12:07:20.979+0200 
>> <http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200> 
>> 7fd4157fa640  0 client.86307 ms_handle_reset on 
>> v2:130.84.80.10:6800/3218663047 <http://130.84.80.10:6800/3218663047>
>>    {
>>         "return_code": 0,
>>         "scrub_tag": "733b1c6d-a418-4c83-bc8e-b28b556e970c",
>>         "mode": "asynchronous"
>>    }
>> 
>>    [root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub status*
>>    2025-04-17T12:07:30.734+0200 7f26cdffb640  0 client.86319 ms_handle_reset 
>> on v2:130.84.80.10:6800/3218663047 2025-04-17T12:07:30.753+0200 
>> <http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200> 
>> 7f26cdffb640  0 client.86325 ms_handle_reset on 
>> v2:130.84.80.10:6800/3218663047 <http://130.84.80.10:6800/3218663047>
>>    {
>>         "status": "no active scrubs running",
>>         "scrubs": {}
>>    }
>>    [root@mon-01 ~]# ceph -s
>>       cluster:
>>         id:     b87276e0-1d92-11ef-a9d6-507c6f66ae2e
>>         *health: HEALTH_ERR             1 MDSs report damaged metadata*
>>             services:
>>         mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 19h)
>>         mgr: mon-02.mqaubn(active, since 19h), standbys: mon-03.gvywio, 
>> mon-01.xhxqdi
>>         mds: 1/1 daemons up, 2 standby
>>         osd: 368 osds: 368 up (since 18h), 368 in (since 3w)
>>             data:
>>         volumes: 1/1 healthy
>>         pools:   10 pools, 4353 pgs
>>         objects: 1.25M objects, 3.9 TiB
>>         usage:   417 TiB used, 6.4 PiB / 6.8 PiB avail
>>         pgs:     4353 active+clean
>> 
>>    Did I miss something ?
>> 
>>    The server didn't crash. I don't understand what you are meaning
>>    by "there may be a design flaw in the infrastructure (insecure
>>    cache, for example)".
>>    How to know if we have a design problem ? What should we check ?
>> 
>>    Best regards,
>> 
>>    Christophe
>> 
>>    On 17/04/2025 11:07, David C. wrote:
>>>    Hello Christophe,
>>> 
>>>    Check the file system scrubbing procedure =>
>>>    https://docs.ceph.com/en/latest/cephfs/scrub/ But this doesn't
>>>    guarantee data recovery.
>>> 
>>>    Was the cluster crashed?
>>>    Ceph should be able to handle it; there may be a design flaw in
>>>    the infrastructure (insecure cache, for example).
>>> 
>>>    David
>>> 
>>>    Le jeu. 17 avr. 2025 à 10:44, Christophe DIARRA
>>>    <christophe.dia...@idris.fr> a écrit :
>>> 
>>>        Hello,
>>> 
>>>        After an electrical maintenance I restarted our ceph cluster
>>>        but it
>>>        remains in an unhealthy state: HEALTH_ERR 1 MDSs report
>>>        damaged metadata.
>>> 
>>>        How to repair this damaged metadata ?
>>> 
>>>        To bring down the cephfs cluster I unmounted the fs from the
>>>        client
>>>        first and then did: ceph fs set cfs_irods_test down true
>>> 
>>>        To bring up the cephfs cluster I did: ceph fs set
>>>        cfs_irods_test down false
>>> 
>>>        Fortunately the cfs_irods_test fs is almost empty and is a fs
>>>        for
>>>        tests.The ceph cluster is not in production yet.
>>> 
>>>        Following is the current status:
>>> 
>>>        [root@mon-01 ~]# ceph health detail
>>>        HEALTH_ERR 1 MDSs report damaged metadata
>>>        *[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
>>>             mds.cfs_irods_test.mon-03.vlmeuz(mds.0): Metadata damage
>>>        detected*
>>> 
>>>        [root@mon-01 ~]# ceph -s
>>>           cluster:
>>>             id:     b87276e0-1d92-11ef-a9d6-507c6f66ae2e
>>>             health: HEALTH_ERR
>>>                     1 MDSs report damaged metadata
>>> 
>>>           services:
>>>             mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 17h)
>>>             mgr: mon-02.mqaubn(active, since 17h), standbys:
>>>        mon-03.gvywio,
>>>        mon-01.xhxqdi
>>>             mds: 1/1 daemons up, 2 standby
>>>             osd: 368 osds: 368 up (since 17h), 368 in (since 3w)
>>> 
>>>           data:
>>>             volumes: 1/1 healthy
>>>             pools:   10 pools, 4353 pgs
>>>             objects: 1.25M objects, 3.9 TiB
>>>             usage:   417 TiB used, 6.4 PiB / 6.8 PiB avail
>>>             pgs:     4353 active+clean
>>> 
>>> 
>>>        [root@mon-01 ~]# ceph fs ls
>>>        name: cfs_irods_test, metadata pool: cfs_irods_md_test, data
>>>        pools:
>>>        [cfs_irods_def_test cfs_irods_data_test ]
>>> 
>>>        [root@mon-01 ~]# ceph mds stat
>>>        cfs_irods_test:1 {0=cfs_irods_test.mon-03.vlmeuz=up:active} 2
>>>        up:standby
>>> 
>>>        [root@mon-01 ~]# ceph fs status
>>>        cfs_irods_test - 0 clients
>>>        ==============
>>>        RANK  STATE MDS                    ACTIVITY DNS
>>>        INOS   DIRS   CAPS
>>>          0    active  cfs_irods_test.mon-03.vlmeuz Reqs:    0 /s
>>>        12     15
>>>        14      0
>>>                 POOL           TYPE     USED  AVAIL
>>>          cfs_irods_md_test   metadata  11.4M  34.4T
>>>          cfs_irods_def_test    data       0   34.4T
>>>        cfs_irods_data_test    data       0   4542T
>>>                    STANDBY MDS
>>>        cfs_irods_test.mon-01.hitdem
>>>        cfs_irods_test.mon-02.awuygq
>>>        MDS version: ceph version 18.2.2
>>>        (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
>>>        [root@mon-01 ~]#
>>> 
>>>        [root@mon-01 ~]# ceph tell mds.cfs_irods_test:0 damage ls
>>>        2025-04-17T10:23:31.849+0200 7f4b87fff640  0 client.86181
>>>        ms_handle_reset on v2:130.84.80.10:6800/3218663047
>>>        <http://130.84.80.10:6800/3218663047>
>>>        2025-04-17T10:23:31.866+0200 7f4b87fff640  0 client.86187
>>>        ms_handle_reset on v2:130.84.80.10:6800/3218663047
>>>        <http://130.84.80.10:6800/3218663047>
>>>        [
>>>             {
>>>        *"damage_type": "dentry",*
>>>                 "id": 241447932,
>>>                 "ino": 1,
>>>                 "frag": "*",
>>>                 "dname": "testdir2",
>>>                 "snap_id": "head",
>>>                 "path": "/testdir2"
>>>             },
>>>             {
>>>        *"damage_type": "dentry"*,
>>>                 "id": 2273238993,
>>>                 "ino": 1,
>>>                 "frag": "*",
>>>                 "dname": "testdir1",
>>>                 "snap_id": "head",
>>>                 "path": "/testdir1"
>>>             }
>>>        ]
>>>        [root@mon-01 ~]#
>>> 
>>>        Any help will be appreciated,
>>> 
>>>        Thanks,
>>> 
>>>        Christophe
>>>        _______________________________________________
>>>        ceph-users mailing list -- ceph-users@ceph.io
>>>        To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: HEALTH_ERR: 1 MDSs report damaged metadata - damage_type=dentry

Reply via email to