I'm pretty sure the reason is due to the damaged MDS daemon. If you are able to clear that up it should allow the filesystem to come back up. I seen something like this a few months ago. We were just able to mark the mds as "repaired" and haven't seen any issue since, however I would discourage doing that without further investigation into the source of the damaged daemon.

Regards,

Bailey Allison
Service Team Lead
45Drives, Ltd.
866-594-7199 x868

On 2025-06-30 10:28, Robert Sander wrote:
Hi,

we are having an issue at a customer site where a 3PB CephFS is in failed state.

The cluster itself is unhealthy and awaits replacements disks:

# ceph -s
  cluster:
    id:     28ca2bfa-d87e-11ed-83a3-1070fddda30f
    health: HEALTH_ERR
            4 failed cephadm daemon(s)
            There are daemons running an older version of ceph
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            8 nearfull osd(s)
            Low space hindering backfill (add storage if this doesn't resolve itself): 46 pgs backfill_toofull
            Possible data damage: 4 pgs inconsistent
            Degraded data redundancy: 6427646/15858772167 objects degraded (0.041%), 8 pgs degraded, 8 pgs undersized
            6 pool(s) nearfull
            (muted: OSDMAP_FLAGS OSD_SCRUB_ERRORS(2d) PG_NOT_DEEP_SCRUBBED PG_NOT_SCRUBBED)

  services:
    mon: 3 daemons, quorum sn01,sn03,sn02 (age 3w)
    mgr: sn03.crlpzh(active, since 33h), standbys: sn01.tegfya, sn02.mzvgcr
    mds: 18/19 daemons up, 1 standby
    osd: 181 osds: 174 up (since 4d), 172 in (since 4d); 206 remapped pgs
         flags nodeep-scrub

  data:
    volumes: 2/3 healthy, 1 recovering; 1 damaged
    pools:   12 pools, 3585 pgs
    objects: 1.93G objects, 1.3 PiB
    usage:   2.5 PiB used, 501 TiB / 3.0 PiB avail
    pgs:     6427646/15858772167 objects degraded (0.041%)
             293845758/15858772167 objects misplaced (1.853%)
             2844 active+clean
             532  active+clean+scrubbing
             147  active+remapped+backfill_wait
             28   active+remapped+backfill_toofull
             11   active+remapped+backfill_wait+backfill_toofull
             10   active+remapped+backfilling
             6 active+undersized+degraded+remapped+backfill_toofull
             2    active+clean+inconsistent
             1    active+clean+scrubbing+deep+inconsistent+repair
             1    active+undersized+remapped+backfilling
             1    active+undersized+degraded+remapped+backfilling
             1    active+recovering+degraded+remapped
             1    active+remapped+inconsistent+backfill_toofull

  io:
    recovery: 183 MiB/s, 312 objects/s


The CephFS metadata pool is not affected by the inconsistent PGs.

The MDSs have this line in their logfile:

"Monitors have assigned me to become a standby."

The filesystem is joinable:

# ceph fs lsflags storage_cluster
joinable allow_snaps allow_multimds_snaps refuse_client_session

But no MDS joins:

# ceph fs status
storage_cluster - 0 clients
===============
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
0    failed
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   490G  12.9T
  cephfs_data      data     970T  54.1T
  shared_data      data    1351T  22.5T
        STANDBY MDS
storage_cluster.sn04.cbvzzu
MDS version: ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)


Why?


Regards
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to