I am attempting an operating system upgrade of a live Ceph cluster. Before I go an screw up my production system, I have been testing on a smaller installation, and I keep running into issues when bringing the Ceph FS metadata server online.
My approach here has been to store all Ceph critical files on non-root partitions, so the OS install can safely proceed without overwriting any of the Ceph configuration or data. Here is how I proceed: First I bring down the Ceph FS via `ceph mds cluster_down`. Second, to prevent OSDs from trying to repair data, I run `ceph osd set noout` Finally I stop the ceph processes in the following order: ceph-mds, ceph-mon, ceph-osd Note my cluster has 1 mds and 1 mon, and 7 osd. I then install the new OS and then bring the cluster back up by walking the steps in reverse: First I start the ceph processes in the following order: ceph-osd, ceph-mon, ceph-mds Second I restore OSD functionality with `ceph osd unset noout` Finally I bring up the Ceph FS via `ceph mds cluster_up` Everything works smoothly except the Ceph FS bring up. The MDS starts in the active:replay state and eventually crashes with the following backtrace: starting mds.cuba at :/0 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors {default=true} 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739 mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to load sessionmap") ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b] 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 5: (()+0x8192) [0x7f31d9c8f192] 6: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739 mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to load sessionmap") ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b] 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 5: (()+0x8192) [0x7f31d9c8f192] 6: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -106> 2017-02-27 16:56:08.233680 7f31daa3b7c0 -1 mds.-1.0 log_to_monitors {default=true} -1> 2017-02-27 16:56:08.537714 7f31d30df700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory 0> 2017-02-27 16:56:08.538493 7f31d30df700 -1 mds/SessionMap.cc <http://sessionmap.cc/>: In function 'void SessionMap::_load_finish(int, ceph::bufferlist&)' thread 7f31d30df700 time 2017-02-27 16:56:08.537739 mds/SessionMap.cc <http://sessionmap.cc/>: 98: FAILED assert(0 == "failed to load sessionmap") ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x98bb4b] 2: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 3: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 4: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 5: (()+0x8192) [0x7f31d9c8f192] 6: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' *** Caught signal (Aborted) ** in thread 7f31d30df700 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph_mds() [0x89984a] 2: (()+0x10350) [0x7f31d9c97350] 3: (gsignal()+0x39) [0x7f31d90d8c49] 4: (abort()+0x148) [0x7f31d90dc058] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] 6: (()+0x5e6f6) [0x7f31d99e16f6] 7: (()+0x5e723) [0x7f31d99e1723] 8: (()+0x5e942) [0x7f31d99e1942] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38] 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 13: (()+0x8192) [0x7f31d9c8f192] 14: (clone()+0x6d) [0x7f31d919c51d] 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) ** in thread 7f31d30df700 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph_mds() [0x89984a] 2: (()+0x10350) [0x7f31d9c97350] 3: (gsignal()+0x39) [0x7f31d90d8c49] 4: (abort()+0x148) [0x7f31d90dc058] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] 6: (()+0x5e6f6) [0x7f31d99e16f6] 7: (()+0x5e723) [0x7f31d99e1723] 8: (()+0x5e942) [0x7f31d99e1942] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38] 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 13: (()+0x8192) [0x7f31d9c8f192] 14: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 0> 2017-02-27 16:56:08.540155 7f31d30df700 -1 *** Caught signal (Aborted) ** in thread 7f31d30df700 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph_mds() [0x89984a] 2: (()+0x10350) [0x7f31d9c97350] 3: (gsignal()+0x39) [0x7f31d90d8c49] 4: (abort()+0x148) [0x7f31d90dc058] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f31d99e3555] 6: (()+0x5e6f6) [0x7f31d99e16f6] 7: (()+0x5e723) [0x7f31d99e1723] 8: (()+0x5e942) [0x7f31d99e1942] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x98bd38] 10: (SessionMap::_load_finish(int, ceph::buffer::list&)+0x2b4) [0x7df2a4] 11: (MDSIOContextBase::complete(int)+0x95) [0x7e34b5] 12: (Finisher::finisher_thread_entry()+0x190) [0x8bd6d0] 13: (()+0x8192) [0x7f31d9c8f192] 14: (clone()+0x6d) [0x7f31d919c51d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. How can I safely stop a Ceph cluster, so that it will cleanly start back up again? -Chris
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com