Hello David,
The SSD model is VO007680JWZJL.
I will delay the 'ceph tell mds.cfs_irods_test:0 damage rm 241447932'
for the moment. If any other solution is found I will be obliged to use
this command.
I found 'dentry' in the logs when the cephfs cluster started:
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.cfs_irods_test.mon-02.awuygq Updating MDS map to version 15613
from mon.2
Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map i am
now mds.0.15612
Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map
state change up:starting --> up:active
Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 active_start
Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir2)
loaded already *corrupt dentry*: [dentry #0x1/testdir2 [2,head]
rep@0.0 NULL (dversion lock) pv=0 v=4442 ino=(n
il) state=0 0x5617e18c8280]
Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir1)
loaded already *corrupt dentry*: [dentry #0x1/testdir1 [2,head]
rep@0.0 NULL (dversion lock) pv=0 v=4442 ino=(n
il) state=0 0x5617e18c8500]
Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1
filesystem is offline (MDS_ALL_DOWN)
Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1
filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)
Apr 16 17:29:53 mon-02 ceph-mon[2288]: from='client.?
xx.xx.xx.8:0/3820885518' entity='client.admin' cmd='[{"prefix": "fs
set", "fs_name": "cfs_irods_test", "var": "down", "val":
"false"}]': finished
Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
mds.cfs_irods_test.mon-02.awuygq assigned to filesystem cfs_irods_test
as rank 0 (now has 1 ranks)
Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared:
MDS_ALL_DOWN (was: 1 filesystem is offline)
Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared:
MDS_UP_LESS_THAN_MAX (was: 1 filesystem is online with fewer MDS than
max_mds)
Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
mds.cfs_irods_test.mon-02.awuygq is now active in filesystem
cfs_irods_test as rank 0
Apr 16 17:29:54 mon-02 ceph-mgr[2444]: log_channel(cluster) log [DBG]
: pgmap v1721: 4353 pgs: 4346 active+clean, 7
active+clean+scrubbing+deep; 3.9 TiB data, 417 TiB used, 6.4 P
iB / 6.8 PiB avail; 1.4 KiB/s rd, 1 op/s
If you need more extract from the log file please let me know.
Thanks for your help,
Christophe
On 17/04/2025 13:39, David C. wrote:
If I'm not mistaken, this is a fairly rare situation.
The fact that it's the result of a power outage makes me think of a
bad SSD (like "S... Pro").
Does a grep of the dentry id in the MDS logs return anything?
Maybe some interesting information around this grep
In the heat of the moment, I have no other idea than to delete the dentry.
ceph tell mds.cfs_irods_test:0 damage rm 241447932
However, in production, this results in the content (of dir
/testdir[12]) being abandoned.
Le jeu. 17 avr. 2025 à 12:44, Christophe DIARRA
<christophe.dia...@idris.fr> a écrit :
Hello David,
Thank you for the tip about the scrubbing. I have tried the
commands found in the documentation but it seems to have no effect:
[root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub start /
recursive,repair,force*
2025-04-17T12:07:20.958+0200 7fd4157fa640 0 client.86301 ms_handle_reset on
v2:130.84.80.10:6800/3218663047 2025-04-17T12:07:20.979+0200
<http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200> 7fd4157fa640 0
client.86307 ms_handle_reset on v2:130.84.80.10:6800/3218663047
<http://130.84.80.10:6800/3218663047>
{
"return_code": 0,
"scrub_tag": "733b1c6d-a418-4c83-bc8e-b28b556e970c",
"mode": "asynchronous"
}
[root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub status*
2025-04-17T12:07:30.734+0200 7f26cdffb640 0 client.86319 ms_handle_reset on
v2:130.84.80.10:6800/3218663047 2025-04-17T12:07:30.753+0200
<http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200> 7f26cdffb640 0
client.86325 ms_handle_reset on v2:130.84.80.10:6800/3218663047
<http://130.84.80.10:6800/3218663047>
{
"status": "no active scrubs running",
"scrubs": {}
}
[root@mon-01 ~]# ceph -s
cluster:
id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
*health: HEALTH_ERR 1 MDSs report damaged metadata*
services:
mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 19h)
mgr: mon-02.mqaubn(active, since 19h), standbys: mon-03.gvywio,
mon-01.xhxqdi
mds: 1/1 daemons up, 2 standby
osd: 368 osds: 368 up (since 18h), 368 in (since 3w)
data:
volumes: 1/1 healthy
pools: 10 pools, 4353 pgs
objects: 1.25M objects, 3.9 TiB
usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
pgs: 4353 active+clean
Did I miss something ?
The server didn't crash. I don't understand what you are meaning
by "there may be a design flaw in the infrastructure (insecure
cache, for example)".
How to know if we have a design problem ? What should we check ?
Best regards,
Christophe
On 17/04/2025 11:07, David C. wrote:
Hello Christophe,
Check the file system scrubbing procedure =>
https://docs.ceph.com/en/latest/cephfs/scrub/ But this doesn't
guarantee data recovery.
Was the cluster crashed?
Ceph should be able to handle it; there may be a design flaw in
the infrastructure (insecure cache, for example).
David
Le jeu. 17 avr. 2025 à 10:44, Christophe DIARRA
<christophe.dia...@idris.fr> a écrit :
Hello,
After an electrical maintenance I restarted our ceph cluster
but it
remains in an unhealthy state: HEALTH_ERR 1 MDSs report
damaged metadata.
How to repair this damaged metadata ?
To bring down the cephfs cluster I unmounted the fs from the
client
first and then did: ceph fs set cfs_irods_test down true
To bring up the cephfs cluster I did: ceph fs set
cfs_irods_test down false
Fortunately the cfs_irods_test fs is almost empty and is a fs
for
tests.The ceph cluster is not in production yet.
Following is the current status:
[root@mon-01 ~]# ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata
*[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
mds.cfs_irods_test.mon-03.vlmeuz(mds.0): Metadata damage
detected*
[root@mon-01 ~]# ceph -s
cluster:
id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
health: HEALTH_ERR
1 MDSs report damaged metadata
services:
mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 17h)
mgr: mon-02.mqaubn(active, since 17h), standbys:
mon-03.gvywio,
mon-01.xhxqdi
mds: 1/1 daemons up, 2 standby
osd: 368 osds: 368 up (since 17h), 368 in (since 3w)
data:
volumes: 1/1 healthy
pools: 10 pools, 4353 pgs
objects: 1.25M objects, 3.9 TiB
usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
pgs: 4353 active+clean
[root@mon-01 ~]# ceph fs ls
name: cfs_irods_test, metadata pool: cfs_irods_md_test, data
pools:
[cfs_irods_def_test cfs_irods_data_test ]
[root@mon-01 ~]# ceph mds stat
cfs_irods_test:1 {0=cfs_irods_test.mon-03.vlmeuz=up:active} 2
up:standby
[root@mon-01 ~]# ceph fs status
cfs_irods_test - 0 clients
==============
RANK STATE MDS ACTIVITY DNS
INOS DIRS CAPS
0 active cfs_irods_test.mon-03.vlmeuz Reqs: 0 /s
12 15
14 0
POOL TYPE USED AVAIL
cfs_irods_md_test metadata 11.4M 34.4T
cfs_irods_def_test data 0 34.4T
cfs_irods_data_test data 0 4542T
STANDBY MDS
cfs_irods_test.mon-01.hitdem
cfs_irods_test.mon-02.awuygq
MDS version: ceph version 18.2.2
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
[root@mon-01 ~]#
[root@mon-01 ~]# ceph tell mds.cfs_irods_test:0 damage ls
2025-04-17T10:23:31.849+0200 7f4b87fff640 0 client.86181
ms_handle_reset on v2:130.84.80.10:6800/3218663047
<http://130.84.80.10:6800/3218663047>
2025-04-17T10:23:31.866+0200 7f4b87fff640 0 client.86187
ms_handle_reset on v2:130.84.80.10:6800/3218663047
<http://130.84.80.10:6800/3218663047>
[
{
*"damage_type": "dentry",*
"id": 241447932,
"ino": 1,
"frag": "*",
"dname": "testdir2",
"snap_id": "head",
"path": "/testdir2"
},
{
*"damage_type": "dentry"*,
"id": 2273238993,
"ino": 1,
"frag": "*",
"dname": "testdir1",
"snap_id": "head",
"path": "/testdir1"
}
]
[root@mon-01 ~]#
Any help will be appreciated,
Thanks,
Christophe
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io