Hi John, Thanks I will look into it. Is there already a new Giant release date?
Jasper ________________________________________ Van: john.sp...@inktank.com [john.sp...@inktank.com] namens John Spray [john.sp...@redhat.com] Verzonden: donderdag 16 oktober 2014 12:23 Aan: Jasper Siero CC: Gregory Farnum; ceph-users Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full Following up: firefly fix for undump is: https://github.com/ceph/ceph/pull/2734 Jasper: if you still need to try undumping on this existing firefly cluster, then you can download ceph-mds packages from this wip-firefly-undump branch from http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/ Cheers, John On Wed, Oct 15, 2014 at 8:15 PM, John Spray <john.sp...@redhat.com> wrote: > Sadly undump has been broken for quite some time (it was fixed in > giant as part of creating cephfs-journal-tool). If there's a one line > fix for this then it's probably worth putting in firefly since it's a > long term supported branch -- I'll do that now. > > John > > On Wed, Oct 15, 2014 at 8:23 AM, Jasper Siero > <jasper.si...@target-holding.nl> wrote: >> Hello Greg, >> >> The dump and reset of the journal was succesful: >> >> [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file >> /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph >> --dump-journal 0 journaldumptgho-mon001 >> journal is 9483323613~134215459 >> read 134213311 bytes at offset 9483323613 >> wrote 134213311 bytes at offset 9483323613 to journaldumptgho-mon001 >> NOTE: this is a _sparse_ file; you can >> $ tar cSzf journaldumptgho-mon001.tgz journaldumptgho-mon001 >> to efficiently compress it while preserving sparseness. >> >> [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file >> /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph >> --reset-journal 0 >> old journal was 9483323613~134215459 >> new journal start will be 9621733376 (4194304 bytes past old end) >> writing journal head >> writing EResetJournal entry >> done >> >> >> Undumping the journal was not successful and looking into the error >> "client_lock.is_locked()" is showed several times. The mds is not running >> when I start the undumping so maybe have forgot something? >> >> [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --pid-file >> /var/run/ceph/mds.th1-mon001.pid -c /etc/ceph/ceph.conf --cluster ceph >> --undump-journal 0 journaldumptgho-mon001 >> undump journaldumptgho-mon001 >> start 9483323613 len 134213311 >> writing header 200.00000000 >> osdc/Objecter.cc: In function 'ceph_tid_t >> Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time 2014-10-15 >> 09:09:32.020287 >> osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked()) >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >> 1: /usr/bin/ceph-mds() [0x80f15e] >> 2: (Dumper::undump(char const*)+0x65d) [0x56c7ad] >> 3: (main()+0x1632) [0x569c62] >> 4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] >> 5: /usr/bin/ceph-mds() [0x567d99] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> 2014-10-15 09:09:32.021313 7fec3e5ad7a0 -1 osdc/Objecter.cc: In function >> 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 time >> 2014-10-15 09:09:32.020287 >> osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked()) >> >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >> 1: /usr/bin/ceph-mds() [0x80f15e] >> 2: (Dumper::undump(char const*)+0x65d) [0x56c7ad] >> 3: (main()+0x1632) [0x569c62] >> 4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] >> 5: /usr/bin/ceph-mds() [0x567d99] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> 0> 2014-10-15 09:09:32.021313 7fec3e5ad7a0 -1 osdc/Objecter.cc: In >> function 'ceph_tid_t Objecter::op_submit(Objecter::Op*)' thread 7fec3e5ad7a0 >> time 2014-10-15 09:09:32.020287 >> osdc/Objecter.cc: 1225: FAILED assert(client_lock.is_locked()) >> >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c >> [root@th1-mon001 ~]# /usr/bin/ceph-mds -i th1-mon001 --p8a65c2c0feba6) >> 1: /usr/bin/ceph-mds() [0x80f15e] >> 2: (Dumper::undump(char const*)+0x65d) [0x56c7ad] >> 3: (main()+0x1632) [0x569c62] >> 4: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] >> 5: /usr/bin/ceph-mds() [0x567d99] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> terminate called after throwing an instance of 'ceph::FailedAssertion' >> *** Caught signal (Aborted) ** >> in thread 7fec3e5ad7a0 >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >> 1: /usr/bin/ceph-mds() [0x82ef61] >> 2: (()+0xf710) [0x7fec3d9a6710] >> 3: (gsignal()+0x35) [0x7fec3ca7c635] >> 4: (abort()+0x175) [0x7fec3ca7de15] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fec3d336a5d] >> 6: (()+0xbcbe6) [0x7fec3d334be6] >> 7: (()+0xbcc13) [0x7fec3d334c13] >> 8: (()+0xbcd0e) [0x7fec3d334d0e] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x7f2) [0x94b812] >> 10: /usr/bin/ceph-mds() [0x80f15e] >> 11: (Dumper::undump(char const*)+0x65d) [0x56c7ad] >> 12: (main()+0x1632) [0x569c62] >> 13: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] >> 14: /usr/bin/ceph-mds() [0x567d99] >> 2014-10-15 09:09:32.024248 7fec3e5ad7a0 -1 *** Caught signal (Aborted) ** >> in thread 7fec3e5ad7a0 >> >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >> 1: /usr/bin/ceph-mds() [0x82ef61] >> 2: (()+0xf710) [0x7fec3d9a6710] >> 3: (gsignal()+0x35) [0x7fec3ca7c635] >> 4: (abort()+0x175) [0x7fec3ca7de15] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fec3d336a5d] >> 6: (()+0xbcbe6) [0x7fec3d334be6] >> 7: (()+0xbcc13) [0x7fec3d334c13] >> 8: (()+0xbcd0e) [0x7fec3d334d0e] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x7f2) [0x94b812] >> 10: /usr/bin/ceph-mds() [0x80f15e] >> 11: (Dumper::undump(char const*)+0x65d) [0x56c7ad] >> 12: (main()+0x1632) [0x569c62] >> 13: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] >> 14: /usr/bin/ceph-mds() [0x567d99] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> 0> 2014-10-15 09:09:32.024248 7fec3e5ad7a0 -1 *** Caught signal >> (Aborted) ** >> in thread 7fec3e5ad7a0 >> >> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) >> 1: /usr/bin/ceph-mds() [0x82ef61] >> 2: (()+0xf710) [0x7fec3d9a6710] >> 3: (gsignal()+0x35) [0x7fec3ca7c635] >> 4: (abort()+0x175) [0x7fec3ca7de15] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fec3d336a5d] >> 6: (()+0xbcbe6) [0x7fec3d334be6] >> 7: (()+0xbcc13) [0x7fec3d334c13] >> 8: (()+0xbcd0e) [0x7fec3d334d0e] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x7f2) [0x94b812] >> 10: /usr/bin/ceph-mds() [0x80f15e] >> 11: (Dumper::undump(char const*)+0x65d) [0x56c7ad] >> 12: (main()+0x1632) [0x569c62] >> 13: (__libc_start_main()+0xfd) [0x7fec3ca68d5d] >> 14: /usr/bin/ceph-mds() [0x567d99] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> Aborted >> >> Jasper >> ________________________________________ >> Van: Gregory Farnum [g...@inktank.com] >> Verzonden: dinsdag 14 oktober 2014 23:40 >> Aan: Jasper Siero >> CC: ceph-users >> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running >> full >> >> ceph-mds --undump-journal <rank> <journal-file> >> Looks like it accidentally (or on purpose? you can break things with >> it) got left out of the help text. >> >> On Tue, Oct 14, 2014 at 8:19 AM, Jasper Siero >> <jasper.si...@target-holding.nl> wrote: >>> Hello Greg, >>> >>> I dumped the journal successful to a file: >>> >>> journal is 9483323613~134215459 >>> read 134213311 bytes at offset 9483323613 >>> wrote 134213311 bytes at offset 9483323613 to journaldumptgho >>> NOTE: this is a _sparse_ file; you can >>> $ tar cSzf journaldumptgho.tgz journaldumptgho >>> to efficiently compress it while preserving sparseness. >>> >>> I see the option for resetting the mds journal but I can't find the option >>> for undumping /importing the journal: >>> >>> usage: ceph-mds -i name [flags] [[--journal_check >>> rank]|[--hot-standby][rank]] >>> -m monitorip:port >>> connect to monitor at given address >>> --debug_mds n >>> debug MDS level (e.g. 10) >>> --dump-journal rank filename >>> dump the MDS journal (binary) for rank. >>> --dump-journal-entries rank filename >>> dump the MDS journal (JSON) for rank. >>> --journal-check rank >>> replay the journal for rank, then exit >>> --hot-standby rank >>> start up as a hot standby for rank >>> --reset-journal rank >>> discard the MDS journal for rank, and replace it with a single >>> event that updates/resets inotable and sessionmap on replay. >>> >>> Do you know how to "undump" the journal back into ceph? >>> >>> Jasper >>> >>> ________________________________________ >>> Van: Gregory Farnum [g...@inktank.com] >>> Verzonden: vrijdag 10 oktober 2014 23:45 >>> Aan: Jasper Siero >>> CC: ceph-users >>> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running >>> full >>> >>> Ugh, "debug journaler", not "debug journaled." >>> >>> That said, the filer output tells me that you're missing an object out >>> of the MDS log. (200.000008f5) I think this issue should be resolved >>> if you "dump" the journal to a file, "reset" it, and then "undump" it. >>> (These are commands you can invoke from ceph-mds.) >>> I haven't done this myself in a long time, so there may be some hard >>> edges around it. In particular, I'm not sure if the dumped journal >>> file will stop when the data stops, or if it will be a little too >>> long. If so, we can fix that by truncating the dumped file to the >>> proper length and resetting and undumping again. >>> (And just to harp on it, this journal manipulation is a lot simpler in >>> Giant... ;) ) >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> On Wed, Oct 8, 2014 at 7:11 AM, Jasper Siero >>> <jasper.si...@target-holding.nl> wrote: >>>> Hello Greg, >>>> >>>> No problem thanks for looking into the log. I attached the log to this >>>> email. >>>> I'm looking forward for the new release because it would be nice to have >>>> more possibilities to diagnose problems. >>>> >>>> Kind regards, >>>> >>>> Jasper Siero >>>> ________________________________________ >>>> Van: Gregory Farnum [g...@inktank.com] >>>> Verzonden: dinsdag 7 oktober 2014 19:45 >>>> Aan: Jasper Siero >>>> CC: ceph-users >>>> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running >>>> full >>>> >>>> Sorry; I guess this fell off my radar. >>>> >>>> The issue here is not that it's waiting for an osdmap; it got the >>>> requested map and went into replay mode almost immediately. In fact >>>> the log looks good except that it seems to finish replaying the log >>>> and then simply fail to transition into active. Generate a new one, >>>> adding in "debug journaled = 20" and "debug filer = 20", and we can >>>> probably figure out how to fix it. >>>> (This diagnosis is much easier in the upcoming Giant!) >>>> -Greg >>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>> >>>> >>>> On Tue, Oct 7, 2014 at 7:55 AM, Jasper Siero >>>> <jasper.si...@target-holding.nl> wrote: >>>>> Hello Gregory, >>>>> >>>>> We still have the same problems with our test ceph cluster and didn't >>>>> receive a reply from you after I send you the requested log files. Do you >>>>> know if it's possible to get our cephfs filesystem working again or is it >>>>> better to give up the files on cephfs and start over again? >>>>> >>>>> We restarted the cluster serveral times but it's still degraded: >>>>> [root@th1-mon001 ~]# ceph -w >>>>> cluster c78209f5-55ea-4c70-8968-2231d2b05560 >>>>> health HEALTH_WARN mds cluster is degraded >>>>> monmap e3: 3 mons at >>>>> {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, >>>>> election epoch 432, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 >>>>> mdsmap e190: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby >>>>> osdmap e2248: 12 osds: 12 up, 12 in >>>>> pgmap v197548: 492 pgs, 4 pools, 60297 MB data, 470 kobjects >>>>> 124 GB used, 175 GB / 299 GB avail >>>>> 491 active+clean >>>>> 1 active+clean+scrubbing+deep >>>>> >>>>> One placement group stays in the deep scrubbing fase. >>>>> >>>>> Kind regards, >>>>> >>>>> Jasper Siero >>>>> >>>>> >>>>> ________________________________________ >>>>> Van: Jasper Siero >>>>> Verzonden: donderdag 21 augustus 2014 16:43 >>>>> Aan: Gregory Farnum >>>>> Onderwerp: RE: [ceph-users] mds isn't working anymore after osd's running >>>>> full >>>>> >>>>> I did restart it but you are right about the epoch number which has >>>>> changed but the situation looks the same. >>>>> 2014-08-21 16:33:06.032366 7f9b5f3cd700 1 mds.0.27 need osdmap epoch >>>>> 1994, have 1993 >>>>> 2014-08-21 16:33:06.032368 7f9b5f3cd700 1 mds.0.27 waiting for osdmap >>>>> 1994 (which blacklists >>>>> prior instance) >>>>> I started the mds with the debug options and attached the log. >>>>> >>>>> Thanks, >>>>> >>>>> Jasper >>>>> ________________________________________ >>>>> Van: Gregory Farnum [g...@inktank.com] >>>>> Verzonden: woensdag 20 augustus 2014 18:38 >>>>> Aan: Jasper Siero >>>>> CC: ceph-users@lists.ceph.com >>>>> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running >>>>> full >>>>> >>>>> After restarting your MDS, it still says it has epoch 1832 and needs >>>>> epoch 1833? I think you didn't really restart it. >>>>> If the epoch numbers have changed, can you restart it with "debug mds >>>>> = 20", "debug objecter = 20", "debug ms = 1" in the ceph.conf and post >>>>> the resulting log file somewhere? >>>>> -Greg >>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>> >>>>> >>>>> On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero >>>>> <jasper.si...@target-holding.nl> wrote: >>>>>> Unfortunately that doesn't help. I restarted both the active and standby >>>>>> mds but that doesn't change the state of the mds. Is there a way to >>>>>> force the mds to look at the 1832 epoch (or earlier) instead of 1833 >>>>>> (need osdmap epoch 1833, have 1832)? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jasper >>>>>> ________________________________________ >>>>>> Van: Gregory Farnum [g...@inktank.com] >>>>>> Verzonden: dinsdag 19 augustus 2014 19:49 >>>>>> Aan: Jasper Siero >>>>>> CC: ceph-users@lists.ceph.com >>>>>> Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's >>>>>> running full >>>>>> >>>>>> On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero >>>>>> <jasper.si...@target-holding.nl> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> We have a small ceph cluster running version 0.80.1 with cephfs on five >>>>>>> nodes. >>>>>>> Last week some osd's were full and shut itself down. To help de osd's >>>>>>> start >>>>>>> again I added some extra osd's and moved some placement group >>>>>>> directories on >>>>>>> the full osd's (which has a copy on another osd) to another place on the >>>>>>> node (as mentioned in >>>>>>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) >>>>>>> After clearing some space on the full osd's I started them again. After >>>>>>> a >>>>>>> lot of deep scrubbing and two pg inconsistencies which needed to be >>>>>>> repaired >>>>>>> everything looked fine except the mds which still is in the replay >>>>>>> state and >>>>>>> it stays that way. >>>>>>> The log below says that mds need osdmap epoch 1833 and have 1832. >>>>>>> >>>>>>> 2014-08-18 12:29:22.268248 7fa786182700 1 mds.-1.0 handle_mds_map >>>>>>> standby >>>>>>> 2014-08-18 12:29:22.273995 7fa786182700 1 mds.0.25 handle_mds_map i am >>>>>>> now >>>>>>> mds.0.25 >>>>>>> 2014-08-18 12:29:22.273998 7fa786182700 1 mds.0.25 handle_mds_map state >>>>>>> change up:standby --> up:replay >>>>>>> 2014-08-18 12:29:22.274000 7fa786182700 1 mds.0.25 replay_start >>>>>>> 2014-08-18 12:29:22.274014 7fa786182700 1 mds.0.25 recovery set is >>>>>>> 2014-08-18 12:29:22.274016 7fa786182700 1 mds.0.25 need osdmap epoch >>>>>>> 1833, >>>>>>> have 1832 >>>>>>> 2014-08-18 12:29:22.274017 7fa786182700 1 mds.0.25 waiting for osdmap >>>>>>> 1833 >>>>>>> (which blacklists prior instance) >>>>>>> >>>>>>> # ceph status >>>>>>> cluster c78209f5-55ea-4c70-8968-2231d2b05560 >>>>>>> health HEALTH_WARN mds cluster is degraded >>>>>>> monmap e3: 3 mons at >>>>>>> {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, >>>>>>> election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 >>>>>>> mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby >>>>>>> osdmap e1951: 12 osds: 12 up, 12 in >>>>>>> pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects >>>>>>> 124 GB used, 175 GB / 299 GB avail >>>>>>> 492 active+clean >>>>>>> >>>>>>> # ceph osd tree >>>>>>> # id weight type name up/down reweight >>>>>>> -1 0.2399 root default >>>>>>> -2 0.05997 host th1-osd001 >>>>>>> 0 0.01999 osd.0 up 1 >>>>>>> 1 0.01999 osd.1 up 1 >>>>>>> 2 0.01999 osd.2 up 1 >>>>>>> -3 0.05997 host th1-osd002 >>>>>>> 3 0.01999 osd.3 up 1 >>>>>>> 4 0.01999 osd.4 up 1 >>>>>>> 5 0.01999 osd.5 up 1 >>>>>>> -4 0.05997 host th1-mon003 >>>>>>> 6 0.01999 osd.6 up 1 >>>>>>> 7 0.01999 osd.7 up 1 >>>>>>> 8 0.01999 osd.8 up 1 >>>>>>> -5 0.05997 host th1-mon002 >>>>>>> 9 0.01999 osd.9 up 1 >>>>>>> 10 0.01999 osd.10 up 1 >>>>>>> 11 0.01999 osd.11 up 1 >>>>>>> >>>>>>> What is the way to get the mds up and running again? >>>>>>> >>>>>>> I still have all the placement group directories which I moved from the >>>>>>> full >>>>>>> osds which where down to create disk space. >>>>>> >>>>>> Try just restarting the MDS daemon. This sounds a little familiar so I >>>>>> think it's a known bug which may be fixed in a later dev or point >>>>>> release on the MDS, but it's a soft-state rather than a disk state >>>>>> issue. >>>>>> -Greg >>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com