I'm pretty sure the reason is due to the damaged MDS daemon. If you are
able to clear that up it should allow the filesystem to come back up. I
seen something like this a few months ago. We were just able to mark the
mds as "repaired" and haven't seen any issue since, however I would
discourage doing that without further investigation into the source of
the damaged daemon.
Regards,
Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868
On 2025-06-30 10:28, Robert Sander wrote:
Hi,
we are having an issue at a customer site where a 3PB CephFS is in
failed state.
The cluster itself is unhealthy and awaits replacements disks:
# ceph -s
cluster:
id: 28ca2bfa-d87e-11ed-83a3-1070fddda30f
health: HEALTH_ERR
4 failed cephadm daemon(s)
There are daemons running an older version of ceph
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
8 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't
resolve itself): 46 pgs backfill_toofull
Possible data damage: 4 pgs inconsistent
Degraded data redundancy: 6427646/15858772167 objects
degraded (0.041%), 8 pgs degraded, 8 pgs undersized
6 pool(s) nearfull
(muted: OSDMAP_FLAGS OSD_SCRUB_ERRORS(2d)
PG_NOT_DEEP_SCRUBBED PG_NOT_SCRUBBED)
services:
mon: 3 daemons, quorum sn01,sn03,sn02 (age 3w)
mgr: sn03.crlpzh(active, since 33h), standbys: sn01.tegfya,
sn02.mzvgcr
mds: 18/19 daemons up, 1 standby
osd: 181 osds: 174 up (since 4d), 172 in (since 4d); 206 remapped pgs
flags nodeep-scrub
data:
volumes: 2/3 healthy, 1 recovering; 1 damaged
pools: 12 pools, 3585 pgs
objects: 1.93G objects, 1.3 PiB
usage: 2.5 PiB used, 501 TiB / 3.0 PiB avail
pgs: 6427646/15858772167 objects degraded (0.041%)
293845758/15858772167 objects misplaced (1.853%)
2844 active+clean
532 active+clean+scrubbing
147 active+remapped+backfill_wait
28 active+remapped+backfill_toofull
11 active+remapped+backfill_wait+backfill_toofull
10 active+remapped+backfilling
6 active+undersized+degraded+remapped+backfill_toofull
2 active+clean+inconsistent
1 active+clean+scrubbing+deep+inconsistent+repair
1 active+undersized+remapped+backfilling
1 active+undersized+degraded+remapped+backfilling
1 active+recovering+degraded+remapped
1 active+remapped+inconsistent+backfill_toofull
io:
recovery: 183 MiB/s, 312 objects/s
The CephFS metadata pool is not affected by the inconsistent PGs.
The MDSs have this line in their logfile:
"Monitors have assigned me to become a standby."
The filesystem is joinable:
# ceph fs lsflags storage_cluster
joinable allow_snaps allow_multimds_snaps refuse_client_session
But no MDS joins:
# ceph fs status
storage_cluster - 0 clients
===============
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
POOL TYPE USED AVAIL
cephfs_metadata metadata 490G 12.9T
cephfs_data data 970T 54.1T
shared_data data 1351T 22.5T
STANDBY MDS
storage_cluster.sn04.cbvzzu
MDS version: ceph version 18.2.4
(e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
Why?
Regards
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io