Re: [ceph-users] Question on cephfs recovery tools

Shinobu Wed, 09 Sep 2015 00:39:42 -0700

Did you unmount filesystem using?

  umount -l


Shinobu

On Wed, Sep 9, 2015 at 4:31 PM, Goncalo Borges <gonc...@physics.usyd.edu.au>
wrote:

> Dear Ceph / CephFS gurus...
>
> Bare a bit with me while I give you a bit of context. Questions will
> appear at the end.
>
> 1) I am currently running ceph 9.0.3 and I have install it  to test the
> cephfs recovery tools.
>
> 2) I've created a situation where I've deliberately (on purpose) lost some
> data and metadata (check annex 1 after the main email).
>
> 3) I've stopped the mds, and waited to check how the cluster reacts. After
> some time, as expected, the cluster reports a ERROR state, with a lot of
> PGs degraded and stuck
>
> # ceph -s
>     cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>      health HEALTH_ERR
>             174 pgs degraded
>             48 pgs stale
>             174 pgs stuck degraded
>             41 pgs stuck inactive
>             48 pgs stuck stale
>             238 pgs stuck unclean
>             174 pgs stuck undersized
>             174 pgs undersized
>             recovery 22366/463263 objects degraded (4.828%)
>             recovery 8190/463263 objects misplaced (1.768%)
>             too many PGs per OSD (388 > max 300)
>             mds rank 0 has failed
>             mds cluster is degraded
>      monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
>             election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>      mdsmap e24: 0/1/1 up, 1 failed
>      osdmap e544: 21 osds: 15 up, 15 in; 87 remapped pgs
>       pgmap v25699: 2048 pgs, 2 pools, 602 GB data, 150 kobjects
>             1715 GB used, 40027 GB / 41743 GB avail
>             22366/463263 objects degraded (4.828%)
>             8190/463263 objects misplaced (1.768%)
>                 1799 active+clean
>                  110 active+undersized+degraded
>                   60 active+remapped
>                   37 stale+undersized+degraded+peered
>                   23 active+undersized+degraded+remapped
>                   11 stale+active+clean
>                    4 undersized+degraded+peered
>                    4 active
>
> 4) I've umounted the cephfs clients ('umount -l' worked for me this time
> but I already had situations where 'umount' would simply hang, and the only
> viable solutions would be to reboot the client).
>
> 5) I've recovered the ceph cluster by (details on the recover operations
> are in annex 2 after the main email.)
> - declaring the osds lost
> - removing the osds from the crush map
> - letting the cluster stabilize and letting all the recover I/O finish
> - identifying stuck PGs
> - checking if they existed, and if not recreate them.
>
>
> 6) I've restarted the MDS. Initially, the mds cluster was considered
> degraded but after some small amount of time, that message disappeared. The
> WARNING status was just because of "too many PGs per OSD (409 > max 300)"
>
> # ceph -s
>     cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>      health HEALTH_WARN
>             too many PGs per OSD (409 > max 300)
>             mds cluster is degraded
>      monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
>             election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>      mdsmap e27: 1/1/1 up {0=rccephmds=up:reconnect}
>      osdmap e614: 15 osds: 15 up, 15 in
>       pgmap v27304: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
>             1761 GB used, 39981 GB / 41743 GB avail
>                 2048 active+clean
>   client io 4151 kB/s rd, 1 op/s
>
> (wait some time)
>
> # ceph -s
>     cluster 8465c6a6-5eb4-4cdf-8845-0de552d0a738
>      health HEALTH_WARN
>             too many PGs per OSD (409 > max 300)
>      monmap e1: 3 mons at
> {mon1=X.X.X.X:6789/0,mon2=Y.Y.Y.Y:6789/0,mon3=Z.Z.Z.Z:6789/0}
>             election epoch 10, quorum 0,1,2 mon1,mon3,mon2
>      mdsmap e29: 1/1/1 up {0=rccephmds=up:active}
>      osdmap e614: 15 osds: 15 up, 15 in
>       pgmap v30442: 2048 pgs, 2 pools, 586 GB data, 146 kobjects
>             1761 GB used, 39981 GB / 41743 GB avail
>                 2048 active+clean
>
> 7) I was able to mount the cephfs filesystem in a client. When I tried to
> read a file made of some lost objects, I got holes in part of the file
> (compare with the same operation on annex 1)
>
> # od /cephfs/goncalo/5Gbytes_029.txt | head
> 0000000 000000 000000 000000 000000 000000 000000 000000 000000
> *
> 2000000 176665 053717 015710 124465 047254 102011 065275 123534
> 2000020 015727 131070 075673 176566 047511 154343 146334 006111
> 2000040 050506 102172 172362 121464 003532 005427 137554 137111
> 2000060 071444 052477 123364 127652 043562 144163 170405 026422
> 2000100 050316 117337 042573 171037 150704 071144 066344 116653
> 2000120 076041 041546 030235 055204 016253 136063 046012 066200
> 2000140 171626 123573 065351 032357 171326 132673 012213 016046
> 2000160 022034 160053 156107 141471 162551 124615 102247 125502
>
>
> Finally the questions:
>
> a./ Under a situation as the one describe above, how can we safely
> terminate cephfs in the clients? I have had situations where umount simply
> hangs and there is no real way to unblock the situation unless I reboot the
> client. If we have hundreds of clients, I would like to avoid that.
>
> b./ I was expecting to have lost metadata information since I've clean
> OSDs where metadata information was stored for the
> /cephfs/goncalo/5Gbytes_029.txt file. I was a bit surprised that the
> /'cephfs/goncalo/5Gbytes_029.txt' was still properly referenced, without me
> having to run any recover tool. What am I missing?
>
> c./ After recovering the cluster, I though I was in a cephfs situation
> where I had
>     c.1 files with holes (because of lost PGs and objects in the data pool)
>     c.2 files without metadata (because of lost PGs and objects in the
> metadata pool)
>     c.3 metadata without associated files (because of lost PGs and objects
> in the data pool)
> I've tried to run the recovery tools, but I have several doubts which I
> did not found described in the documentation
>     - Is there a specific order / a way to run the tools for the c.1, c.2
> and c.3 cases I mentioned?
>
> d./ Since I was testing, I simply ran the following sequence but I am not
> sure of what the command are doing, nor if the sequence is correct. I think
> an example use case should be documented. Specially the cephfs-data-scan
> did not returned any output, or information. So, I am not sure if anything
> happened at all.
>
> # cephfs-table-tool 0 reset session
> {
>     "0": {
>         "data": {},
>         "result": 0
>     }
> }
>
> # cephfs-table-tool 0 reset snap
> {
>     "result": 0
> }
>
> # cephfs-table-tool 0 reset inode
> {
>     "0": {
>         "data": {},
>         "result": 0
>     }
> }
>
> # cephfs-journal-tool --rank=0 journal reset
> old journal was 4194304~22381701
> new journal start will be 29360128 (2784123 bytes past old end)
> writing journal head
> writing EResetJournal entry
> done
>
> # cephfs-data-scan init
>
> # cephfs-data-scan scan_extents cephfs_dt
> # cephfs-data-scan scan_inodes cephfs_dt
>
> # cephfs-data-scan scan_extents --force-pool cephfs_mt (doesn't seem to
> work)
>
> e./ After running the cephfs tools, everything seemed exactly in the same
> status. No visible changes or errors at the filesystem level. So, at this
> point not sure what to conclude...
>
>
> Thank you in Advance for your responses
> Cheers
> Goncalo
>
>
> # #####################
> # ANNEX 1: GENERATE DATA LOSS #
> # #####################
>
> 1) Check a file
> # ls -l /cephfs/goncalo/5Gbytes_029.txt
> -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
> /cephfs/goncalo/5Gbytes_029.txt
>
> --- * ---
>
> 2) See its contents
> # od /cephfs/goncalo/5Gbytes_029.txt |  head
> 0000000 150343 117016 156040 100553 154377 174521 137643 047440
> 0000020 006310 013157 064422 136662 145623 116101 137007 031237
> 0000040 111570 010104 103540 126335 014632 053445 006114 047003
> 0000060 123201 170045 042771 036561 152363 017716 000405 053556
> 0000100 102524 106517 066114 071112 144366 011405 074170 032621
> 0000120 047761 177217 103414 000774 174320 122332 110323 065706
> 0000140 042467 035356 132363 067446 145351 155277 177533 062050
> 0000160 016303 030741 066567 043517 172655 176016 017304 033342
> 0000200 177440 130510 163707 060513 055027 107702 023012 130435
> 0000220 022342 011762 035372 044033 152230 043424 004062 177461
>
> --- * ---
>
> 3) Get its inode, and convert it to HEX
> # ls -li /cephfs/goncalo/5Gbytes_029.txt
> 1099511627812 -rw-r--r-- 1 root root 5368709120 Sep  8 03:55
> /cephfs/goncalo/5Gbytes_029.txt
>
> (1099511627812)_base = (10000000024)_base16
>
> --- * ---
>
> 4) Get the osd pool details
> # ceph osd pool ls detail
> pool 1 'cephfs_dt' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 196 flags
> hashpspool crash_replay_interval 45 stripe_width 0
> pool 2 'cephfs_mt' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 182 flags
> hashpspool stripe_width 0
>
> --- * ---
>
> 5) Get the file / PG / OSD mapping
>
> # ceph osd map cephfs_dt 10000000024.00000000
> osdmap e479 pool 'cephfs_dt' (1) object '10000000024.00000000' -> pg
> 1.c18fbb6f (1.36f) -> up ([19,15,6], p19) acting ([19,15,6], p19)
> # ceph osd map cephfs_mt 10000000024.00000000
> osdmap e479 pool 'cephfs_mt' (2) object '10000000024.00000000' -> pg
> 2.c18fbb6f (2.36f) -> up ([27,23,13], p27) acting ([27,23,13], p27)
>
> --- * ---
>
> 6) Kill the relevant osd daemons, umount the osd partition and delete the
> partitions
>
> [root@server1 ~]# for o in 6; do dev=`df /var/lib/ceph/osd/ceph-$o | tail
> -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
> /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8}
> rm 2; partprobe; done
> [root@server2 ~]# for o in 13 15; do dev=`df /var/lib/ceph/osd/ceph-$o |
> tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
> /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8}
> rm 2; partprobe; done
> [root@server3 ~]# for o in 19 23; do dev=`df /var/lib/ceph/osd/ceph-$o |
> tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
> /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8}
> rm 2; partprobe; done
> [root@server4 ~]# for o in 27; do dev=`df /var/lib/ceph/osd/ceph-$o |
> tail -n 1 | awk '{print $1}'`; /etc/init.d/ceph stop osd.$o; umount
> /var/lib/ceph/osd/ceph-$o; parted -s ${dev::8} rm 1; parted -s  ${dev::8}
> rm 2; partprobe; done
>
>
> # #######################
> # ANNEX 2: RECOVER CEPH CLUSTER #
> # #######################
>
> 1) Declare OSDS losts
>
> # for o in 6 13 15 19 23 27;do ceph osd lost $o --yes-i-really-mean-it;
> done
> marked osd lost in epoch 480
> marked osd lost in epoch 482
> marked osd lost in epoch 487
> marked osd lost in epoch 483
> marked osd lost in epoch 489
> marked osd lost in epoch 485
>
> --- * ---
>
> 2) Remove OSDs from CRUSH map
>
> # for o in 6 13 15 19 23 27;do ceph osd crush remove osd.$o; ceph osd down
> $o; ceph osd rm $o; ceph auth del osd.$o; done
> removed item id 6 name 'osd.6' from crush map
> osd.6 is already down.
> removed osd.6
> updated
> removed item id 13 name 'osd.13' from crush map
> osd.13 is already down.
> removed osd.13
> updated
> removed item id 15 name 'osd.15' from crush map
> osd.15 is already down.
> removed osd.15
> updated
> removed item id 19 name 'osd.19' from crush map
> osd.19 is already down.
> removed osd.19
> updated
> removed item id 23 name 'osd.23' from crush map
> osd.23 is already down.
> removed osd.23
> updated
> removed item id 27 name 'osd.27' from crush map
> osd.27 is already down.
> removed osd.27
> updated
>
> --- * ---
>
> 3) Give time to the cluster react, and to the recover I/O to finish.
>
> --- * ---
>
> 4) Check which PGS are still stale
>
> # ceph pg dump_stuck stale
> ok
> pg_stat    state    up    up_primary    acting    acting_primary
> 1.23    stale+undersized+degraded+peered    [23]    23    [23]    23
> 2.38b    stale+undersized+degraded+peered    [23]    23    [23]    23
> (...)
>
> --- * ---
>
> 5) Try to query those stale PGs
>
> # for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print $1}'`;
> do ceph pg $pg query; done
> ok
> Error ENOENT: i don't have pgid 1.23
> Error ENOENT: i don't have pgid 2.38b
> (...)
>
> --- * ---
>
> 6) Create the non existing PGs
>
> # for pg in `ceph pg dump_stuck stale | grep ^[12]  | awk '{print $1}'`;
> do ceph pg force_create_pg $pg; done
> ok
> pg 1.23 now creating, ok
> pg 2.38b now creating, ok
> (...)
>
> --- * ---
>
> 7) At this point, for the PGs to leave the 'creating' status, I had to
> restart all remaining OSDs. Otherwise those PGs were in the creating state
> forever.
>
>
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Email:
 - shin...@linux.com <shin...@linux.com>
Blog:
 - Life with Distributed Computational System based on OpenSource
<http://i-shinobu.hatenablog.com/>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Question on cephfs recovery tools

Reply via email to