Hi, after running the cluster for years, it is the first time i have a problem that seems to require some expert knowlege ;-)
I had some network problems, which i suspect that lead to the damaged mds service. Just as a side note: i upgraded to 17.2.2 but it had been running for about a hour till the network outage, so i don't think this was related. However, now i wonder how to go from here. ``` # ceph -s cluster: health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged services: mon: 5 daemons, quorum ceph3,ceph4,ceph1,ceph5,ceph2 (age 46m) mgr: ceph5.zmvagf(active, since 5h), standbys: ceph2.defhpj mds: 0/1 daemons up, 2 standby osd: 10 osds: 10 up (since 46m), 10 in (since 2h) data: volumes: 0/1 healthy, 1 recovering; 1 damaged pools: 4 pools, 193 pgs objects: 5.42M objects, 13 TiB usage: 26 TiB used, 11 TiB / 37 TiB avail pgs: 193 active+clean io: client: 200 KiB/s rd, 84 KiB/s wr, 199 op/s rd, 169 op/s wr ``` ``` # ceph health detail HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds daemon damaged [WRN] FS_DEGRADED: 1 filesystem is degraded fs cephfs is degraded [ERR] MDS_ALL_DOWN: 1 filesystem is offline fs cephfs is offline because no MDS is active for it. [ERR] MDS_DAMAGE: 1 mds daemon damaged fs cephfs mds.0 is damaged ``` ``` # ceph fs status cephfs - 0 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed POOL TYPE USED AVAIL cephfs.cephfs.meta metadata 1293M 1954G cephfs.cephfs.data data 2797G 1954G ecpool data 23.2T 2931G STANDBY MDS cephfs.ceph1.yzqmuo cephfs.ceph3.vmieie MDS version: ceph version 19.2.2 (0eceb0defba60152a8182f7bd87d164b639885b8) squid (stable) ``` According to * https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ i came up with the following procedure. Deny all reconnect to clients. ```sh ceph config set mds mds_deny_all_reconnect true ``` Deny new clients. ```sh ceph fs set cephfs refuse_client_session true ``` Backup the current journal: ```sh mkdir /root/mds-damaged cephfs-journal-tool --rank=cephfs:0 journal export /root/mds-damaged/backup-rank0.bin ``` Recover file metadata and discard what is damaged: ```sh cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary ``` Truncate any journal that is corrupt or that an MDS cannot replay: ```sh cephfs-journal-tool --rank=cephfs:0 journal reset --yes-i-really-really-mean-it ``` Reset the SessionMap: ```sh cephfs-table-tool all reset session ``` Does this make sense? Yours, bbk _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io