Hi,

after running the cluster for years, it is the first time i have a problem that 
seems to require some expert knowlege ;-)

I had some network problems, which i suspect that lead to the damaged mds 
service. Just as a side note: i upgraded to 17.2.2 but it had been running for 
about a hour till the network outage, so i don't think this was related.

However, now i wonder how to go from here.

```
# ceph -s
  cluster:
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged

  services:
    mon: 5 daemons, quorum ceph3,ceph4,ceph1,ceph5,ceph2 (age 46m)
    mgr: ceph5.zmvagf(active, since 5h), standbys: ceph2.defhpj
    mds: 0/1 daemons up, 2 standby
    osd: 10 osds: 10 up (since 46m), 10 in (since 2h)

  data:
    volumes: 0/1 healthy, 1 recovering; 1 damaged
    pools:   4 pools, 193 pgs
    objects: 5.42M objects, 13 TiB
    usage:   26 TiB used, 11 TiB / 37 TiB avail
    pgs:     193 active+clean

  io:
    client:   200 KiB/s rd, 84 KiB/s wr, 199 op/s rd, 169 op/s wr
```

```
# ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds daemon 
damaged
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
    fs cephfs is offline because no MDS is active for it.
[ERR] MDS_DAMAGE: 1 mds daemon damaged
    fs cephfs mds.0 is damaged
```

```
# ceph fs status
cephfs - 0 clients
======
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS  
 0    failed                                        
       POOL           TYPE     USED  AVAIL  
cephfs.cephfs.meta  metadata  1293M  1954G  
cephfs.cephfs.data    data    2797G  1954G  
      ecpool          data    23.2T  2931G  
    STANDBY MDS      
cephfs.ceph1.yzqmuo  
cephfs.ceph3.vmieie  
MDS version: ceph version 19.2.2 (0eceb0defba60152a8182f7bd87d164b639885b8) 
squid (stable)
```

According to

 * https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/

i came up with the following procedure.

Deny all reconnect to clients.

```sh
ceph config set mds mds_deny_all_reconnect true
```

Deny new clients.

```sh
ceph fs set cephfs refuse_client_session true
```

Backup the current journal:

```sh
mkdir /root/mds-damaged
cephfs-journal-tool --rank=cephfs:0 journal export 
/root/mds-damaged/backup-rank0.bin
```

Recover file metadata and discard what is damaged:

```sh
cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
```

Truncate any journal that is corrupt or that an MDS cannot replay:

```sh
cephfs-journal-tool --rank=cephfs:0 journal reset --yes-i-really-really-mean-it
```

Reset the SessionMap:

```sh
cephfs-table-tool all reset session
```

Does this make sense?

Yours,
bbk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to