Hi Christophe,
Do you have a 'rados df' or 'ceph df' output (of all pools, not just
the one from the test filesystem) from the time when the 'rados ls'
command was failing?
I'm trying to determine if the pools were incorrectly reported as
empty due to stuck PGs or OSDs. This is important as we based our
decision to delete the filesystem on this (possibily wrong) information.
Regards,
Frédéric.
----- Le 23 Avr 25, à 18:19, Christophe DIARRA
<christophe.dia...@idris.fr> a écrit :
Hello Frédéric,
I have a new working fs now after deleting the fs + pools and
recreating them. I will mount the fs on the test client, create
some files and do some tests:
1. shutdown and restart the cluster to see what will happen to the
metadata
2. redo the test 1. by removing power from the rack for hours
after the cluster is shut down
I will let you know, when the tests will be finished.
Following is the current status:
[mon-01 ~]# ceph fs status
cfs_irods_test - 0 clients
==============
RANK STATE MDS ACTIVITY
DNS INOS DIRS CAPS
0active cfs_irods_test.mon-01.hitdem Reqs: 0 /s 10 13
12 0
POOL TYPE USED AVAIL
cfs_irods_md_test metadata 96.0k 34.4T
cfs_irods_def_test data 0 2018T
cfs_irods_data_test data 0 4542T
STANDBY MDS
cfs_irods_test.mon-03.vlmeuz
cfs_irods_test.mon-02.awuygq
MDS version: ceph version 18.2.2
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
Many thanks to you Frédéric and also to David, Anthony and Michel
for the advice and remarks.
Best regards,
Christophe
On 23/04/2025 14:30, Christophe DIARRA wrote:
Hello Frédéric, Michel,
Rebooting the OSD one by one solved the problem of 'rados ls'.
Now it is working fine for all the pools.
The next step is to recreate the cephfs fs I deleted yesterday
because of the damaged metadata problem.
I will let you know.
Thanks,
Christophe
On 23/04/2025 12:47, Christophe DIARRA wrote:
Hello Frédéric,
Thank you for the answer.
osd_mclock_max_capacity_iops_hdd is not defined. Only
osd_mclock_max_capacity_iops_ssd is defined. I suppose
that these are defaults values.
I didn't know anything about them until now.
[mon-01 ~]# ceph config dump | grep
osd_mclock_max_capacity_iops_hdd
[cmon-01 ~]#
[mon-01 ~]# ceph config dump | grep
osd_mclock_max_capacity_iops
osd.352 basic
osd_mclock_max_capacity_iops_ssd 47136.994042
osd.353 basic
osd_mclock_max_capacity_iops_ssd 45567.566829
osd.354 basic
osd_mclock_max_capacity_iops_ssd 44979.777767
osd.355 basic
osd_mclock_max_capacity_iops_ssd 44494.118337
osd.356 basic
osd_mclock_max_capacity_iops_ssd 48002.559112
osd.357 basic
osd_mclock_max_capacity_iops_ssd 54686.144097
osd.358 basic
osd_mclock_max_capacity_iops_ssd 42349.183758
osd.359 basic
osd_mclock_max_capacity_iops_ssd 58134.190143
osd.360 basic
osd_mclock_max_capacity_iops_ssd 46867.824097
osd.361 basic
osd_mclock_max_capacity_iops_ssd 54869.366372
osd.362 basic
osd_mclock_max_capacity_iops_ssd 55875.432057
osd.363 basic
osd_mclock_max_capacity_iops_ssd 58346.849381
osd.364 basic
osd_mclock_max_capacity_iops_ssd 52520.181799
osd.365 basic
osd_mclock_max_capacity_iops_ssd 46632.056458
osd.366 basic
osd_mclock_max_capacity_iops_ssd 45746.055260
osd.367 basic
osd_mclock_max_capacity_iops_ssd 47884.575954
I will restart the OSD nodes one by one and let you know
if 'rados ls' works again.
Thanks,
Christophe
On 23/04/2025 12:23, Frédéric Nass wrote:
Hi Christophe,
Response inline
----- Le 23 Avr 25, à 11:42, Christophe
DIARRA<christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> a écrit :
Hello Frédéric,
I removed the fs but haven't recreated it yet
because I have a doubt about the
health of the cluster even though it seems healthy:
[mon-01 ~]# ceph -s
cluster:
id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
health: HEALTH_OK
services:
mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 6d)
mgr: mon-02.mqaubn(active, since 6d), standbys:
mon-03.gvywio, mon-01.xhxqdi
osd: 368 osds: 368 up (since 16h), 368 in (since 3w)
data:
pools: 10 pools, 4353 pgs
objects: 1.25M objects, 3.9 TiB
usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
pgs: 4353 active+clean
I observed that listing the objects in any hdd
pool will hang at the beginning
for and empty hdd pool or after displaying the
list of objects.
I need to do a Ctrl-C to interrupt the hung 'rados
ls' command. I don't have
this problem with the pools on sdd.
[mon-01 ~]# rados lspools
.mgr
pool_rbd_rep3_hdd <------ hdd pool
pool_rbd_rep3_ssd
rbd_ec_k6m2_hdd <------ hdd pool
rbd_ec_k6m2_ssd
metadata_4hddrbd_rep3_ssd
metadata_4ssdrbd_rep3_ssd
cfs_irods_md_test
cfs_irods_def_test
cfs_irods_data_test <------ hdd pool
[mon-01 ~]# 1) Testing 'rados ls' on hdd pools:
[mon-01 ~]# rados -p cfs_irods_data_test ls
(hangs forever) ==> Ctrl-C
[mon-01 ~]# rados -p pool_rbd_rep3_hdd ls|head -2
rbd_data.565ed6699dd8.0000000000097ff6
rbd_data.565ed6699dd8.00000000001041fb
(then hangs forever here) ==> Ctrl-C
[mon-01 ~]# rados -p pool_rbd_rep3_hdd ls
rbd_data.565ed6699dd8.0000000000097ff6
rbd_data.565ed6699dd8.00000000001041fb
rbd_data.565ed6699dd8.000000000004f1a3
...
(list truncated by me)
...
rbd_data.565ed6699dd8.000000000016809e
rbd_data.565ed6699dd8.000000000007bc05
(then hangs forever here) ==> Ctrl-C
2) With the pools on ssd everything works well
(the 'rados ls' commands doesn't
hang):
[mon-01 ~]# for i in $(rados lspools|egrep
'ssd|md|def'); do echo -n "Pool $i
:"; rados -p $i ls |wc -l; done
Pool pool_rbd_rep3_ssd :197298
Pool rbd_ec_k6m2_ssd :101552
Pool metadata_4hddrbd_rep3_ssd :5
Pool metadata_4ssdrbd_rep3_ssd :5
Pool cfs_irods_md_test :0
Pool cfs_irods_def_test :0
Below is the configuration of the cluster:
- 3 MONs (HPE DL360) + 8 OSD servers ( HPE Apol lo
4510 gen10)
- each OSD server has 44x20TB HDD + 10x7.6TB SSD
This is dense. :-/
- On each OSD server, 8 SSD are partioned and used
for the wal/db of the HDD OSD
- On each OSD server 2 SSD are used for the ceph
fs metadata and default data
pools.
Do you see any configuration problem here which
could lead to our metadata
problem ?
Do you know what could cause the hang of the
'rados ls' command on the HDD pools
? I would like to understand this problem before
recreating an new cephfs fs.
Inaccessible PGs, misbehaving OSDs, mClock scheduler
in use with osd_mclock_max_capacity_iops_hdd (auto)set
way too low (check 'ceph config dump | grep
osd_mclock_max_capacity_iops_hdd').
Since this is consecutive to an electrical maintenance
(power outage?), if osd_mclock_max_capacity_iops_hdd
is not the issue, I would restart all HDD OSDs one by
one or node by node to have all PGs repeered. Then try
the 'rados ls' command again.
Regards,
Frédéric.
The cluster is still is testing state so we can do
any tests you could
recommend.
Thanks,
Christophe
On 22/04/2025 16:46, Christophe DIARRA wrote:
Hello Frédéric,
15 of the 16 parallel scanning workers
terminated almost immediately . But one
worker is still running for 1+ hour:
[mon-01 log]# ps -ef|grep scan
root 1977927 1925004 0 15:18 pts/0 00:00:00
cephfs-data-scanscan_extents
--filesystem cfs_irods_test --worker_n 11
--worker_m 16
[mon-01 log]# date;lsof -p 1977927|grep osd
Tue Apr 22 04:37:05 PM CEST 2025
cephfs-da 1977927 root 15u IPv4 7105122 0t0
TCP mon-01:34736->osd-06:6912
(ESTABLISHED)
cephfs-da 1977927 root 18u IPv4 7110774 0t0
TCP mon-01:45122->osd-03:ethoscan
(ESTABLISHED)
cephfs-da 1977927 root 19u IPv4 7105123 0t0
TCP mon-01:58556->osd-07:spg
(ESTABLISHED)
cephfs-da 1977927 root 20u IPv4 7049672 0t0
TCP mon-01:55064->osd-01:7112
(ESTABLISHED)
cephfs-da 1977927 root 21u IPv4 7082598 0t0
TCP mon-01:42120->osd-03-data:6896
(SYN_SENT)
[mon-01 log]#
The filesystem is empty. So I will follow your
advice and remove it. After that
I will recreate it.
I will redo some proper shutdown and restart
of the cluster to check if the
problem reappears with the newly recreated fs.
I will let you know.
Thank you for your help,
Christophe
On 22/04/2025 15:56, Frédéric Nass wrote:
That, is weird for 2 reasons.
The first reason is that the
cephfs-data-scan should not run for a
couple of
hours on empty data pools. I just tried to
run it on an empty pool and it
doesn't run for more than maybe 10 seconds.
The second reason is that the data pool
cfs_irods_def_test should not be empty,
even with if the filesystem tree is. It
should at least have a few rados
objects named after
{100,200,400,60x}.00000000 and the root
inode 1.00000000 /
1.00000000.inode unless you removed the
filesystem by running the 'ceph fs rm
<filesystem_name> --yes-i-really-mean-it'
command which does remove rados
objects in the associated pools.
If it's clear for you that this filesystem
should be empty, I'd advise you to
remove it (using the 'ceph fs rm'
command), delete any rados objects in the
metadata and data pools, and then recreate
the filesystem.
Regards,
Frédéric.
----- Le 22 Avr 25, à 15:13, Christophe
DIARRA [
mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ] a
écrit :
Hello Frédéric,
I have:
[mon-01 ~]# rados df | grep -E
'OBJECTS|cfs_irods_def_test|cfs_irods_data_test'
POOL_NAME USED OBJECTS CLONES COPIES
MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD
WR_OPS WR USED COMPR UNDER COMPR
cfs_irods_data_test 0 B 0 0 0
0 0 0 0 0 B
0 0 B 0 B 0 B
cfs_irods_def_test 0 B 0 0 0
0 0 0 1 0 B
80200 157 GiB 0 B 0 B
[mon-01 ~]#
I will interrupt the current scanning
process and rerun it with
more workers.
Thanks,
Christophe
On 22/04/2025 15:05, Frédéric Nass wrote:
Hum... Obviously this 'empty' filesystem
has way more rados
objects in the 2 data pools than expected.
You should see that
many objects with:
rados df | grep -E
'OBJECTS|cfs_irods_def_test|cfs_irods_data_test'
If waiting is not an option, you can break
the scan_extents
command, re-run it with multiple workers,
and then proceed
with the next scan (scan_links). Just make
sure you run the
next scan with multiple workers as well.
Regards,
Frédéric.
----- Le 22 Avr 25, à 14:54, Christophe
DIARRA
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ]
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<mailto:christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ] a
écrit :
Hello Frédéric,
I ran the commands (see below) but the
command
'cephfs-data-scan scan_extents --filesystem
cfs_irods_test' is not finished yet. It
has been running
for 2+ hours. I didn't run it in parallel
because it
contains empty directories only. According
to [1]:
"scan_extents and scan_inodes commands may
take a very
long time if the data pool contains many
files or very
large files. Now I think I should have run
the command in
parallel. I don't know if it is safe to
interrupt it and
then rerun it with 16 workers.
On 22/04/2025 12:13, Frédéric Nass wrote:
Hi Christophe,
You could but it won't be of any help
since the
journal is empty. What you can do to fix
the fs
metadata is to run the below commands from
the
disaster-recovery-experts documentation
[1] in this
particular order:
#Prevent access to the fs and set it down.
ceph fs set cfs_irods_test
refuse_client_session true
ceph fs set cfs_irods_test joinable false
ceph fs set cfs_irods_test down true
[mon-01 ~]# ceph fs set cfs_irods_test
refuse_client_session true
client(s) blocked from establishing new
session(s)
[mon-01 ~]# ceph fs set cfs_irods_test
joinable false
cfs_irods_test marked not joinable; MDS
cannot join as
newly active.
[mon-01 ~]# ceph fs set cfs_irods_test
down true
cfs_irods_test marked down.
# Reset maps and journal
cephfs-table-tool cfs_irods_test:0 reset
session
cephfs-table-tool cfs_irods_test:0 reset snap
cephfs-table-tool cfs_irods_test:0 reset
inode
[mon-01 ~]# cephfs-table-tool
cfs_irods_test:0 reset session
{
"0": {
"data": {},
"result": 0
}
}
[mon-01 ~]# cephfs-table-tool
cfs_irods_test:0 reset snap
Error ((2) No such file or directory)
2025-04-22T12:29:09.550+0200 7f1d4c03e100
-1 main: Bad
rank selection: cfs_irods_test:0'
[mon-01 ~]# cephfs-table-tool
cfs_irods_test:0 reset inode
Error ((2) No such file or
directory2025-04-22T12:29:43.880+0200
7f0878a3a100 -1
main: Bad rank selection: cfs_irods_test:0'
)
cephfs-journal-tool --rank
cfs_irods_test:0 journal
reset --force
cephfs-data-scan init --force-init
--filesystem
cfs_irods_test
[mon-01 ~]# cephfs-journal-tool --rank
cfs_irods_test:0
journal reset --force
Error ((2) No such file or directory)
2025-04-22T12:34:42.474+0200 7fe8b3a36100
-1 main:
Couldn't determine MDS rank.
[mon-01 ~]# cephfs-data-scan init
--force-init
--filesystem cfs_irods_test
[mon-01 ~]#
# Rescan data and fix metadata (leaving
the below
commands commented for information on how
to // these
scan tasks)
#for i in {0..15} ; do cephfs-data-scan
scan_frags
--filesystem cfs_irods_test
--force-corrupt --worker_n
$i --worker_m 16 & done
#for i in {0..15} ; do cephfs-data-scan
scan_extents
--filesystem cfs_irods_test --worker_n $i
--worker_m
16 & done
#for i in {0..15} ; do cephfs-data-scan
scan_inodes
--filesystem cfs_irods_test
--force-corrupt --worker_n
$i --worker_m 16 & done
#for i in {0..15} ; do cephfs-data-scan
scan_links
--filesystem cfs_irods_test --worker_n $i
--worker_m
16 & done
cephfs-data-scan scan_frags --filesystem
cfs_irods_test --force-corrupt
cephfs-data-scan scan_extents --filesystem
cfs_irods_test
[mon-01 ~]# cephfs-data-scan scan_frags
--filesystem
cfs_irods_test --force-corrupt
[mon-01 ~]# cephfs-data-scan scan_extents
--filesystem
cfs_irods_test *------> still running*
I don't know how long it will take. Once
it will be
completed I will run the remaining commands.
Thanks,
Christophe
cephfs-data-scan scan_inodes --filesystem
cfs_irods_test --force-corrupt
cephfs-data-scan scan_links --filesystem
cfs_irods_test
cephfs-data-scan cleanup --filesystem
cfs_irods_test
#ceph mds repaired 0 <---- should not be
necessary
# Set the fs back online and accessible
ceph fs set cfs_irods_test down false
ceph fs set cfs_irods_test joinable true
ceph fs set cfs_irods_test
refuse_client_session false
An MDS should now start, if not then use
'ceph orch
daemon restart mds.xxxxx' to start a MDS.
After
remounting the fs you should be able to
access
/testdir1 and /testdir2 in the fs root.
# scrub the fs again to check that if
everything is OK.
ceph tell mds.cfs_irods_test:0 scrub start /
recursive,repair,force
Regards,
Frédéric.
[1]
[https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
|
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
]
----- Le 22 Avr 25, à 10:21, Christophe
DIARRA
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ]
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<mailto:christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ] a
écrit :
Hello Frédéric,
Thank your for your help.
Following is output you asked for:
[mon-01 ~]# date
Tue Apr 22 10:09:10 AM CEST 2025
[root@fidrcmon-01 ~]# ceph tell
mds.cfs_irods_test:0 scrub start /
recursive,repair,force
2025-04-22T10:09:12.796+0200 7f43f6ffd640 0
client.86553 ms_handle_reset on
v2:130.84.80.10:6800/3218663047
2025-04-22T10:09:12.818+0200 7f43f6ffd640 0
client.86559 ms_handle_reset on
v2:130.84.80.10:6800/3218663047
{
"return_code": 0,
"scrub_tag":
"12e537bb-bb39-4f3b-ae09-e0a1ae6ce906",
"mode": "asynchronous"
}
[root@fidrcmon-01 ~]# ceph tell
mds.cfs_irods_test:0 scrub status
2025-04-22T10:09:31.760+0200 7f3f0f7fe640 0
client.86571 ms_handle_reset on
v2:130.84.80.10:6800/3218663047
2025-04-22T10:09:31.781+0200 7f3f0f7fe640 0
client.86577 ms_handle_reset on
v2:130.84.80.10:6800/3218663047
{
"status": "no active scrubs running",
"scrubs": {}
}
[root@fidrcmon-01 ~]# cephfs-journal-tool
--rank
cfs_irods_test:0 event recover_dentries list
2025-04-16T18:24:56.802960+0200 0x7c334a
SUBTREEMAP: ()
[root@fidrcmon-01 ~]#
Based on this output, can I run the other
three
commands provided in your message :
ceph tell mds.0 flush journal
ceph mds fail 0
ceph tell mds.cfs_irods_test:0 scrub start
/ recursive
Thanks,
Christophe
On 19/04/2025 12:55, Frédéric Nass wrote:
Hi Christophe, Hi David,
Could you share the ouptut of the below
command after running the scrubbing with
recursive,repair,force?
cephfs-journal-tool --rank
cfs_irods_test:0 event recover_dentries list
Could be that the MDS recovered these 2
dentries in its journal already but the
status of the filesystem was not updated
yet. I've seen this happening before.
If that the case, you could try a flush,
fail and re-scrub:
ceph tell mds.0 flush journal
ceph mds fail 0
ceph tell mds.cfs_irods_test:0 scrub start
/ recursive
This might clear the HEALTH_ERR. If not,
then it will be easy to fix by
rebuilding / fixing the metadata from the
data pools since this fs is empty.
Let us know,
Regards,
Frédéric.
----- Le 18 Avr 25, à 9:51,
[mailto:daviddavid.cas...@aevoo.fr
<mailto:daviddavid.cas...@aevoo.fr> |
daviddavid.cas...@aevoo.fr ] a écrit :
I also tend to think that the disk has
nothing to do with the problem.
My reading is that the inode associated
with the dentry is missing.
Can anyone correct me?
Christophe informed me that the
directories were emptied before the
incident.
I don't understand why scrubbing doesn't
repair the meta data.
Perhaps because the directory is empty ?
Le jeu. 17 avr. 2025 à 19:06, Anthony
D'Atri [mailto:anthony.da...@gmail.com
<mailto:anthony.da...@gmail.com> |
<anthony.da...@gmail.com>
<mailto:anthony.da...@gmail.com> ]
[mailto:anthony.da...@gmail.com
<mailto:anthony.da...@gmail.com> |
<mailto:anthony.da...@gmail.com>
<mailto:anthony.da...@gmail.com> ] a
écrit :
HPE rebadges drives from manufacturers. A
quick search supports the idea
that this SKU is fulfilled at least partly
by Kioxia, so not likely a PLP
issue.
On Apr 17, 2025, at 11:39 AM, Christophe
DIARRA <
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|christophe.dia...@idris.fr ] > wrote:
Hello David,
The SSD model is VO007680JWZJL.
I will delay the 'ceph tell
mds.cfs_irods_test:0 damage rm 241447932'
for the moment. If any other solution is
found I will be obliged to use
this command.
I found 'dentry' in the logs when the
cephfs cluster started:
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.cfs_irods_test.mon-02.awuygq
Updating MDS map to version 15613 from mon.2
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.0.15612 handle_mds_map i am
now mds.0.15612
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.0.15612 handle_mds_map state
change up:starting --> up:active
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.0.15612 active_start
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.0.cache.den(0x1 testdir2)
loaded already *corrupt dentry*: [dentry
#0x1/testdir2 [2, [mailto:head
<mailto:head>]rep@0.0
|head]rep@0.0 ]
NULL (dversion lock) pv=0 v=4442 ino=(n
il) state=0 0x5617e18c8280]
Apr 16 17:29:53 mon-02 ceph-mds[2367]:
mds.0.cache.den(0x1 testdir1)
loaded already *corrupt dentry*: [dentry
#0x1/testdir1 [2, [mailto:head
<mailto:head>]rep@0.0
|head]rep@0.0 ]
NULL (dversion lock) pv=0 v=4442 ino=(n
il) state=0 0x5617e18c8500]
Apr 16 17:29:53 mon-02 ceph-mon[2288]:
Health check failed: 1
filesystem is offline (MDS_ALL_DOWN)
Apr 16 17:29:53 mon-02 ceph-mon[2288]:
Health check failed: 1
filesystem is online with fewer MDS than
max_mds (MDS_UP_LESS_THAN_MAX)
Apr 16 17:29:53 mon-02 ceph-mon[2288]:
from='client.?
xx.xx.xx.8:0/3820885518'
entity='client.admin' cmd='[{"prefix": "fs
set",
"fs_name": "cfs_irods_test", "var":
"down", "val":
"false"}]': finished
Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
mds.cfs_irods_test.mon-02.awuygq assigned
to filesystem cfs_irods_test as
rank 0 (now has 1 ranks)
Apr 16 17:29:53 mon-02 ceph-mon[2288]:
Health check cleared:
MDS_ALL_DOWN (was: 1 filesystem is offline)
Apr 16 17:29:53 mon-02 ceph-mon[2288]:
Health check cleared:
MDS_UP_LESS_THAN_MAX (was: 1 filesystem is
online with fewer MDS than
max_mds)
Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
mds.cfs_irods_test.mon-02.awuygq is now
active in filesystem cfs_irods_test
as rank 0
Apr 16 17:29:54 mon-02 ceph-mgr[2444]:
log_channel(cluster) log [DBG] :
pgmap v1721: 4353 pgs: 4346 active+clean,
7 active+clean+scrubbing+deep;
3.9 TiB data, 417 TiB used, 6.4 P
iB / 6.8 PiB avail; 1.4 KiB/s rd, 1 op/s
If you need more extract from the log file
please let me know.
Thanks for your help,
Christophe
On 17/04/2025 13:39, David C. wrote:
If I'm not mistaken, this is a fairly rare
situation.
The fact that it's the result of a power
outage makes me think of a bad
SSD (like "S... Pro").
Does a grep of the dentry id in the MDS
logs return anything?
Maybe some interesting information around
this grep
In the heat of the moment, I have no other
idea than to delete the
dentry.
ceph tell mds.cfs_irods_test:0 damage rm
241447932
However, in production, this results in
the content (of dir
/testdir[12]) being abandoned.
Le jeu. 17 avr. 2025 à 12:44, Christophe
DIARRA <
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|christophe.dia...@idris.fr ] > a écrit :
Hello David,
Thank you for the tip about the scrubbing.
I have tried the
commands found in the documentation but it
seems to have no effect:
[root@mon-01 ~]#*ceph tell
mds.cfs_irods_test:0 scrub start /
recursive,repair,force*
2025-04-17T12:07:20.958+0200 7fd4157fa640
0 client.86301
ms_handle_reset on
v2:130.84.80.10:6800/3218663047
2025-04-17T12:07:20.979+0200<
[http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200
|
http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200
] > [
http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200
|
<http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200>
<http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200>
]
7fd4157fa640 0 client.86307
ms_handle_reset on v2:
130.84.80.10:6800/3218663047
[http://130.84.80.10:6800/3218663047 |
<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
[http://130.84.80.10:6800/3218663047 |
<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
{
"return_code": 0,
"scrub_tag":
"733b1c6d-a418-4c83-bc8e-b28b556e970c",
"mode": "asynchronous"
}
[root@mon-01 ~]#*ceph tell
mds.cfs_irods_test:0 scrub status*
2025-04-17T12:07:30.734+0200 7f26cdffb640
0 client.86319
ms_handle_reset on
v2:130.84.80.10:6800/3218663047
2025-04-17T12:07:30.753+0200<
[http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200
|
http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200
] > [
http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200
|
<http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200>
<http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200>
]
7f26cdffb640 0 client.86325
ms_handle_reset on v2:
130.84.80.10:6800/3218663047
[http://130.84.80.10:6800/3218663047 |
<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
[http://130.84.80.10:6800/3218663047 |
<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
{
"status": "no active scrubs running",
"scrubs": {}
}
[root@mon-01 ~]# ceph -s
cluster:
id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
*health: HEALTH_ERR 1 MDSs report damaged
metadata*
services:
mon: 3 daemons, quorum
mon-01,mon-03,mon-02 (age 19h)
mgr: mon-02.mqaubn(active, since 19h),
standbys: mon-03.gvywio,
mon-01.xhxqdi
mds: 1/1 daemons up, 2 standby
osd: 368 osds: 368 up (since 18h), 368 in
(since 3w)
data:
volumes: 1/1 healthy
pools: 10 pools, 4353 pgs
objects: 1.25M objects, 3.9 TiB
usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
pgs: 4353 active+clean
Did I miss something ?
The server didn't crash. I don't
understand what you are meaning
by "there may be a design flaw in the
infrastructure (insecure
cache, for example)".
How to know if we have a design problem ?
What should we check ?
Best regards,
Christophe
On 17/04/2025 11:07, David C. wrote:
Hello Christophe,
Check the file system scrubbing procedure =>
[https://docs.ceph.com/en/latest/cephfs/scrub/
|
https://docs.ceph.com/en/latest/cephfs/scrub/
] But this doesn't
guarantee data recovery.
Was the cluster crashed?
Ceph should be able to handle it; there
may be a design flaw in
the infrastructure (insecure cache, for
example).
David
Le jeu. 17 avr. 2025 à 10:44, Christophe
DIARRA
[mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ] [
mailto:christophe.dia...@idris.fr
<mailto:christophe.dia...@idris.fr>
|<mailto:christophe.dia...@idris.fr>
<mailto:christophe.dia...@idris.fr> ] a
écrit :
Hello,
After an electrical maintenance I
restarted our ceph cluster
but it
remains in an unhealthy state: HEALTH_ERR
1 MDSs report
damaged metadata.
How to repair this damaged metadata ?
To bring down the cephfs cluster I
unmounted the fs from the
client
first and then did: ceph fs set
cfs_irods_test down true
To bring up the cephfs cluster I did: ceph
fs set
cfs_irods_test down false
Fortunately the cfs_irods_test fs is
almost empty and is a fs
for
tests.The ceph cluster is not in
production yet.
Following is the current status:
[root@mon-01 ~]# ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata
*[ERR] MDS_DAMAGE: 1 MDSs report damaged
metadata
mds.cfs_irods_test.mon-03.vlmeuz(mds.0):
Metadata damage
detected*
[root@mon-01 ~]# ceph -s
cluster:
id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
health: HEALTH_ERR
1 MDSs report damaged metadata
services:
mon: 3 daemons, quorum
mon-01,mon-03,mon-02 (age 17h)
mgr: mon-02.mqaubn(active, since 17h),
standbys:
mon-03.gvywio,
mon-01.xhxqdi
mds: 1/1 daemons up, 2 standby
osd: 368 osds: 368 up (since 17h), 368 in
(since 3w)
data:
volumes: 1/1 healthy
pools: 10 pools, 4353 pgs
objects: 1.25M objects, 3.9 TiB
usage: 417 TiB used, 6.4 PiB / 6.8 PiB avail
pgs: 4353 active+clean
[root@mon-01 ~]# ceph fs ls
name: cfs_irods_test, metadata pool:
cfs_irods_md_test, data
pools:
[cfs_irods_def_test cfs_irods_data_test ]
[root@mon-01 ~]# ceph mds stat
cfs_irods_test:1
{0=cfs_irods_test.mon-03.vlmeuz=up:active} 2
up:standby
[root@mon-01 ~]# ceph fs status
cfs_irods_test - 0 clients
==============
RANK STATE MDS ACTIVITY DNS
INOS DIRS CAPS
0 active cfs_irods_test.mon-03.vlmeuz
Reqs: 0 /s
12 15
14 0
POOL TYPE USED AVAIL
cfs_irods_md_test metadata 11.4M 34.4T
cfs_irods_def_test data 0 34.4T
cfs_irods_data_test data 0 4542T
STANDBY MDS
cfs_irods_test.mon-01.hitdem
cfs_irods_test.mon-02.awuygq
MDS version: ceph version 18.2.2
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)
[root@mon-01 ~]#
[root@mon-01 ~]# ceph tell
mds.cfs_irods_test:0 damage ls
2025-04-17T10:23:31.849+0200 7f4b87fff640
0 client.86181
ms_handle_reset on
v2:130.84.80.10:6800/3218663047
[http://130.84.80.10:6800/3218663047
|<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
[http://130.84.80.10:6800/3218663047
|<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
2025-04-17T10:23:31.866+0200 7f4b87fff640
0 client.86187
ms_handle_reset on
v2:130.84.80.10:6800/3218663047
[http://130.84.80.10:6800/3218663047
|<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
[http://130.84.80.10:6800/3218663047
|<http://130.84.80.10:6800/3218663047>
<http://130.84.80.10:6800/3218663047> ]
[
{
*"damage_type": "dentry",*
"id": 241447932,
"ino": 1,
"frag": "*",
"dname": "testdir2",
"snap_id": "head",
"path": "/testdir2"
},
{
*"damage_type": "dentry"*,
"id": 2273238993,
"ino": 1,
"frag": "*",
"dname": "testdir1",
"snap_id": "head",
"path": "/testdir1"
}
]
[root@mon-01 ~]#
Any help will be appreciated,
Thanks,
Christophe
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email
[mailto:toceph-users-le...@ceph.io
<mailto:toceph-users-le...@ceph.io> |
toceph-users-le...@ceph.io ]
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email
[mailto:toceph-users-le...@ceph.io
<mailto:toceph-users-le...@ceph.io> |
toceph-users-le...@ceph.io ]
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email
[mailto:toceph-users-le...@ceph.io
<mailto:toceph-users-le...@ceph.io> |
toceph-users-le...@ceph.io ]
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email
[mailto:toceph-users-le...@ceph.io
<mailto:toceph-users-le...@ceph.io> |
toceph-users-le...@ceph.io ]
_______________________________________________
ceph-users mailing list --
[mailto:ceph-users@ceph.io
<mailto:ceph-users@ceph.io> |ceph-users@ceph.io ]
To unsubscribe send an email to
[mailto:ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io> |
ceph-users-le...@ceph.io ]
_______________________________________________
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io