Hi Christophe, Hi David, Could you share the ouptut of the below command after running the scrubbing with recursive,repair,force?
cephfs-journal-tool --rank cfs_irods_test:0 event recover_dentries list Could be that the MDS recovered these 2 dentries in its journal already but the status of the filesystem was not updated yet. I've seen this happening before. If that the case, you could try a flush, fail and re-scrub: ceph tell mds.0 flush journal ceph mds fail 0 ceph tell mds.cfs_irods_test:0 scrub start / recursive This might clear the HEALTH_ERR. If not, then it will be easy to fix by rebuilding / fixing the metadata from the data pools since this fs is empty. Let us know, Regards, Frédéric. ----- Le 18 Avr 25, à 9:51, David david.cas...@aevoo.fr a écrit : > I also tend to think that the disk has nothing to do with the problem. > > My reading is that the inode associated with the dentry is missing. > Can anyone correct me? > > Christophe informed me that the directories were emptied before the > incident. > > I don't understand why scrubbing doesn't repair the meta data. > Perhaps because the directory is empty ? > > Le jeu. 17 avr. 2025 à 19:06, Anthony D'Atri <anthony.da...@gmail.com> a > écrit : > >> HPE rebadges drives from manufacturers. A quick search supports the idea >> that this SKU is fulfilled at least partly by Kioxia, so not likely a PLP >> issue. >> >> >> > On Apr 17, 2025, at 11:39 AM, Christophe DIARRA < >> christophe.dia...@idris.fr> wrote: >> > >> > Hello David, >> > >> > The SSD model is VO007680JWZJL. >> > >> > I will delay the 'ceph tell mds.cfs_irods_test:0 damage rm 241447932' >> for the moment. If any other solution is found I will be obliged to use >> this command. >> > >> > I found 'dentry' in the logs when the cephfs cluster started: >> > >> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.cfs_irods_test.mon-02.awuygq >> Updating MDS map to version 15613 from mon.2 >> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map i am >> now mds.0.15612 >> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map state >> change up:starting --> up:active >> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 active_start >> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir2) >> loaded already *corrupt dentry*: [dentry #0x1/testdir2 [2,head] rep@0.0 >> NULL (dversion lock) pv=0 v=4442 ino=(n >> >> il) state=0 0x5617e18c8280] >> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir1) >> loaded already *corrupt dentry*: [dentry #0x1/testdir1 [2,head] rep@0.0 >> NULL (dversion lock) pv=0 v=4442 ino=(n >> >> il) state=0 0x5617e18c8500] >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1 >> filesystem is offline (MDS_ALL_DOWN) >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1 >> filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX) >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: from='client.? >> xx.xx.xx.8:0/3820885518' entity='client.admin' cmd='[{"prefix": "fs set", >> "fs_name": "cfs_irods_test", "var": "down", "val": >> >> "false"}]': finished >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon >> mds.cfs_irods_test.mon-02.awuygq assigned to filesystem cfs_irods_test as >> rank 0 (now has 1 ranks) >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared: >> MDS_ALL_DOWN (was: 1 filesystem is offline) >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared: >> MDS_UP_LESS_THAN_MAX (was: 1 filesystem is online with fewer MDS than >> max_mds) >> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon >> mds.cfs_irods_test.mon-02.awuygq is now active in filesystem cfs_irods_test >> as rank 0 >> >> Apr 16 17:29:54 mon-02 ceph-mgr[2444]: log_channel(cluster) log [DBG] : >> pgmap v1721: 4353 pgs: 4346 active+clean, 7 active+clean+scrubbing+deep; >> 3.9 TiB data, 417 TiB used, 6.4 P >> >> iB / 6.8 PiB avail; 1.4 KiB/s rd, 1 op/s >> >> >> > If you need more extract from the log file please let me know. >> > >> > Thanks for your help, >> > >> > Christophe >> > >> > >> > >> > On 17/04/2025 13:39, David C. wrote: >> >> If I'm not mistaken, this is a fairly rare situation. >> >> >> >> The fact that it's the result of a power outage makes me think of a bad >> SSD (like "S... Pro"). >> >> >> >> Does a grep of the dentry id in the MDS logs return anything? >> >> Maybe some interesting information around this grep >> >> >> >> In the heat of the moment, I have no other idea than to delete the >> dentry. >> >> >> >> ceph tell mds.cfs_irods_test:0 damage rm 241447932 >> >> >> >> However, in production, this results in the content (of dir >> /testdir[12]) being abandoned. >> >> >> >> >> >> Le jeu. 17 avr. 2025 à 12:44, Christophe DIARRA < >> christophe.dia...@idris.fr> a écrit : >> >> >> >> Hello David, >> >> >> >> Thank you for the tip about the scrubbing. I have tried the >> >> commands found in the documentation but it seems to have no effect: >> >> >> >> [root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub start / >> recursive,repair,force* >> >> 2025-04-17T12:07:20.958+0200 7fd4157fa640 0 client.86301 >> ms_handle_reset on v2:130.84.80.10:6800/3218663047 >> 2025-04-17T12:07:20.979+0200 < >> http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200> >> 7fd4157fa640 0 client.86307 ms_handle_reset on v2: >> 130.84.80.10:6800/3218663047 <http://130.84.80.10:6800/3218663047> >> >> { >> >> "return_code": 0, >> >> "scrub_tag": "733b1c6d-a418-4c83-bc8e-b28b556e970c", >> >> "mode": "asynchronous" >> >> } >> >> >> >> [root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub status* >> >> 2025-04-17T12:07:30.734+0200 7f26cdffb640 0 client.86319 >> ms_handle_reset on v2:130.84.80.10:6800/3218663047 >> 2025-04-17T12:07:30.753+0200 < >> http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200> >> 7f26cdffb640 0 client.86325 ms_handle_reset on v2: >> 130.84.80.10:6800/3218663047 <http://130.84.80.10:6800/3218663047> >> >> { >> >> "status": "no active scrubs running", >> >> "scrubs": {} >> >> } >> >> [root@mon-01 ~]# ceph -s >> >> cluster: >> >> id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e >> >> *health: HEALTH_ERR 1 MDSs report damaged metadata* >> >> services: >> >> mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 19h) >> >> mgr: mon-02.mqaubn(active, since 19h), standbys: mon-03.gvywio, >> mon-01.xhxqdi >> >> mds: 1/1 daemons up, 2 standby >> >> osd: 368 osds: 368 up (since 18h), 368 in (since 3w) >> >> data: >> >> volumes: 1/1 healthy >> >> pools: 10 pools, 4353 pgs >> >> objects: 1.25M objects, 3.9 TiB >> >> usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail >> >> pgs: 4353 active+clean >> >> >> >> Did I miss something ? >> >> >> >> The server didn't crash. I don't understand what you are meaning >> >> by "there may be a design flaw in the infrastructure (insecure >> >> cache, for example)". >> >> How to know if we have a design problem ? What should we check ? >> >> >> >> Best regards, >> >> >> >> Christophe >> >> >> >> On 17/04/2025 11:07, David C. wrote: >> >>> Hello Christophe, >> >>> >> >>> Check the file system scrubbing procedure => >> >>> https://docs.ceph.com/en/latest/cephfs/scrub/ But this doesn't >> >>> guarantee data recovery. >> >>> >> >>> Was the cluster crashed? >> >>> Ceph should be able to handle it; there may be a design flaw in >> >>> the infrastructure (insecure cache, for example). >> >>> >> >>> David >> >>> >> >>> Le jeu. 17 avr. 2025 à 10:44, Christophe DIARRA >> >>> <christophe.dia...@idris.fr> a écrit : >> >>> >> >>> Hello, >> >>> >> >>> After an electrical maintenance I restarted our ceph cluster >> >>> but it >> >>> remains in an unhealthy state: HEALTH_ERR 1 MDSs report >> >>> damaged metadata. >> >>> >> >>> How to repair this damaged metadata ? >> >>> >> >>> To bring down the cephfs cluster I unmounted the fs from the >> >>> client >> >>> first and then did: ceph fs set cfs_irods_test down true >> >>> >> >>> To bring up the cephfs cluster I did: ceph fs set >> >>> cfs_irods_test down false >> >>> >> >>> Fortunately the cfs_irods_test fs is almost empty and is a fs >> >>> for >> >>> tests.The ceph cluster is not in production yet. >> >>> >> >>> Following is the current status: >> >>> >> >>> [root@mon-01 ~]# ceph health detail >> >>> HEALTH_ERR 1 MDSs report damaged metadata >> >>> *[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata >> >>> mds.cfs_irods_test.mon-03.vlmeuz(mds.0): Metadata damage >> >>> detected* >> >>> >> >>> [root@mon-01 ~]# ceph -s >> >>> cluster: >> >>> id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e >> >>> health: HEALTH_ERR >> >>> 1 MDSs report damaged metadata >> >>> >> >>> services: >> >>> mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 17h) >> >>> mgr: mon-02.mqaubn(active, since 17h), standbys: >> >>> mon-03.gvywio, >> >>> mon-01.xhxqdi >> >>> mds: 1/1 daemons up, 2 standby >> >>> osd: 368 osds: 368 up (since 17h), 368 in (since 3w) >> >>> >> >>> data: >> >>> volumes: 1/1 healthy >> >>> pools: 10 pools, 4353 pgs >> >>> objects: 1.25M objects, 3.9 TiB >> >>> usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail >> >>> pgs: 4353 active+clean >> >>> >> >>> >> >>> [root@mon-01 ~]# ceph fs ls >> >>> name: cfs_irods_test, metadata pool: cfs_irods_md_test, data >> >>> pools: >> >>> [cfs_irods_def_test cfs_irods_data_test ] >> >>> >> >>> [root@mon-01 ~]# ceph mds stat >> >>> cfs_irods_test:1 {0=cfs_irods_test.mon-03.vlmeuz=up:active} 2 >> >>> up:standby >> >>> >> >>> [root@mon-01 ~]# ceph fs status >> >>> cfs_irods_test - 0 clients >> >>> ============== >> >>> RANK STATE MDS ACTIVITY DNS >> >>> INOS DIRS CAPS >> >>> 0 active cfs_irods_test.mon-03.vlmeuz Reqs: 0 /s >> >>> 12 15 >> >>> 14 0 >> >>> POOL TYPE USED AVAIL >> >>> cfs_irods_md_test metadata 11.4M 34.4T >> >>> cfs_irods_def_test data 0 34.4T >> >>> cfs_irods_data_test data 0 4542T >> >>> STANDBY MDS >> >>> cfs_irods_test.mon-01.hitdem >> >>> cfs_irods_test.mon-02.awuygq >> >>> MDS version: ceph version 18.2.2 >> >>> (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) >> >>> [root@mon-01 ~]# >> >>> >> >>> [root@mon-01 ~]# ceph tell mds.cfs_irods_test:0 damage ls >> >>> 2025-04-17T10:23:31.849+0200 7f4b87fff640 0 client.86181 >> >>> ms_handle_reset on v2:130.84.80.10:6800/3218663047 >> >>> <http://130.84.80.10:6800/3218663047> >> >>> 2025-04-17T10:23:31.866+0200 7f4b87fff640 0 client.86187 >> >>> ms_handle_reset on v2:130.84.80.10:6800/3218663047 >> >>> <http://130.84.80.10:6800/3218663047> >> >>> [ >> >>> { >> >>> *"damage_type": "dentry",* >> >>> "id": 241447932, >> >>> "ino": 1, >> >>> "frag": "*", >> >>> "dname": "testdir2", >> >>> "snap_id": "head", >> >>> "path": "/testdir2" >> >>> }, >> >>> { >> >>> *"damage_type": "dentry"*, >> >>> "id": 2273238993, >> >>> "ino": 1, >> >>> "frag": "*", >> >>> "dname": "testdir1", >> >>> "snap_id": "head", >> >>> "path": "/testdir1" >> >>> } >> >>> ] >> >>> [root@mon-01 ~]# >> >>> >> >>> Any help will be appreciated, >> >>> >> >>> Thanks, >> >>> >> >>> Christophe >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users@ceph.io >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >>> >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io