Hi, Did you restart all of the ceph services just on node 1 so far? Or did you restart mons on each node first, then managers on each node, etc.,? I have seen during ceph upgrades if services are restarted out of order a similar issue occurs (restarting all ceph services on a single node).
Regards, Bailey >-----Original Message----- >From: s.smagu...@gmail.com <s.smagu...@gmail.com> >Sent: July 26, 2023 4:34 AM >To: ceph-users@ceph.io >Subject: [ceph-users] OSD stuck on booting state after upgrade (v15.2.17 -> v17.2.6) > >Updating the Ceph cluster from the Octopus version (v15.2.17) to Quincy (v17.2.6). > >We used ceph-deploy to update all ceph packages on all hosts, and then we restarted the services one by one (mon -> mgr -> osd -> rgw). >During the restart on the first node, all osd encountered an issue where they didn't change to the "up" state and got stuck in "booting" state. > ># ceph daemon osd.3 status >{ > "cluster_fsid": "f95b201c-4cd6-4c36-a54e-7f2b68608b8f", > "osd_fsid": "b0141718-a2ac-4a26-808b-17b6741b789e", > "whoami": 3, > "state": "booting", > "oldest_map": 4437792, > "newest_map": 4441114, > "num_pgs": 29 >} > >While changing "ceph osd require-osd-release quincy," the monitor service crashed with errors. > ># ceph report | jq '.osdmap.require_osd_release' >"nautilus" > >-2> 2023-07-25T12:10:20.977+0600 7f245a84f700 5 >-2> mon.ceph-ph-mon1-dc3@0(leader).paxos(paxos updating c >-2> 81819224..81819937) is_readable = 1 - >-2> now=2023-07-25T12:10:20.981801+0600 >-2> lease_expire=2023-07-25T12:10:25.959818+0600 has v0 lc 81819937 >-1> 2023-07-25T12:10:20.997+0600 7f245a84f700 -1 >-1> /build/ceph-17.2.6/src/mon/OSDMonitor.cc: In function 'bool >-1> OSDMonitor::prepare_command_impl(MonOpRequestRef, const cmdmap_t&)' >-1> thread 7f245a84f700 time 2023-07-25T12:10:20.981991+0600 >/build/ceph-17.2.6/src/mon/OSDMonitor.cc: 11631: FAILED ceph_assert(osdmap.require_osd_release >= ceph_release_t::octopus) > > ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f24629d3878] > 2: /usr/lib/ceph/libceph-common.so.2(+0x27da8a) [0x7f24629d3a8a] > 3: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_stri >ng<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basi c_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std: >:__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std: >:vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<lo ng, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0xcb13) [0x5569f209a823] > 4: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x45f) [0x5569f20ab89f] > 5: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x162) [0x5569f20baa42] > 6: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x716) [0x5569f201fd86] > 7: (PaxosService::C_RetryMessage::_finish(int)+0x6c) [0x5569f1f4f93c] > 8: (C_MonOp::finish(int)+0x4b) [0x5569f1ebbb3b] > 9: (Context::complete(int)+0xd) [0x5569f1ebaa0d] > 10: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5569f1ef11e0] > 11: (Paxos::finish_round()+0xb1) [0x5569f2015a61] > 12: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0x11e3) [0x5569f20172a3] > 13: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x49f) [0x5569f2019f7f] > 14: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x14f4) [0x5569f1eb7f34] > 15: (Monitor::_ms_dispatch(Message*)+0xa68) [0x5569f1eb8bd8] > 16: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5d) [0x5569f1ef2c4d] > 17: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f2462c71da0] > 18: (DispatchQueue::entry()+0x58f) [0x7f2462c6f63f] > 19: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2462d40b61] > 20: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f24624f4609] > 21: clone() > > 0> 2023-07-25T12:10:21.009+0600 7f245a84f700 -1 *** Caught signal (Aborted) ** in thread 7f245a84f700 thread_name:ms_dispatch > > ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) > 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f24625003c0] > 2: gsignal() > 3: abort() > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b7) [0x7f24629d38e0] > 5: /usr/lib/ceph/libceph-common.so.2(+0x27da8a) [0x7f24629d3a8a] > 6: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_stri >ng<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basi c_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std: >:__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std: >:vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<lo ng, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0xcb13) [0x5569f209a823] > 7: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x45f) [0x5569f20ab89f] > 8: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x162) [0x5569f20baa42] > 9: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x716) [0x5569f201fd86] > 10: (PaxosService::C_RetryMessage::_finish(int)+0x6c) [0x5569f1f4f93c] > 11: (C_MonOp::finish(int)+0x4b) [0x5569f1ebbb3b] > 12: (Context::complete(int)+0xd) [0x5569f1ebaa0d] > 13: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5569f1ef11e0] > 14: (Paxos::finish_round()+0xb1) [0x5569f2015a61] > 15: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0x11e3) [0x5569f20172a3] > 16: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x49f) [0x5569f2019f7f] > 17: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x14f4) [0x5569f1eb7f34] > 18: (Monitor::_ms_dispatch(Message*)+0xa68) [0x5569f1eb8bd8] > 19: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5d) [0x5569f1ef2c4d] > 20: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x460) [0x7f2462c71da0] > 21: (DispatchQueue::entry()+0x58f) [0x7f2462c6f63f] > 22: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2462d40b61] > 23: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f24624f4609] > 24: clone() > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > >Now the cluster looks as follows: > ># ceph -s > cluster: > id: f95b201c-4cd6-4c36-a54e-7f2b68608b8f > health: HEALTH_WARN > noout flag(s) set > 12 osds down > 1 host (12 osds) down > all OSDs are running octopus or later but require_osd_release < octopus > Degraded data redundancy: 5731463/11463184 objects degraded (49.999%), 240 pgs degraded, 315 pgs undersized > > services: > mon: 3 daemons, quorum ceph-ph-mon1-dc3,hw-ceph-ph3-dc3,hw-ceph-ph4-dc3 (age 68m) > mgr: ceph-ph-mon1-dc3.alahd.kz.test.bash.kz(active, since 45m) > osd: 25 osds: 13 up (since 26h), 25 in (since 21h); 6 remapped pgs > flags noout > rgw: 1 daemon active (1 hosts, 1 zones) > > data: > pools: 8 pools, 321 pgs > objects: 5.73M objects, 4.4 TiB > usage: 8.8 TiB used, 32 TiB / 40 TiB avail > pgs: 5731463/11463184 objects degraded (49.999%) > 129/11463184 objects misplaced (0.001%) > 240 active+undersized+degraded > 75 active+undersized > 6 active+clean+remapped > ># ceph osd df tree >ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME >-1 42.23981 - 22 TiB 4.7 TiB 4.6 TiB 1.2 GiB 22 GiB 17 TiB 0 0 - root default >-2 20.19991 - 0 B 0 B 0 B 0 B 0 B 0 B 0 0 - host hw-ceph-ph3 > 0 hdd 1.81999 1.00000 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down osd.0 > 1 hdd 1.81999 1.00000 1.8 TiB 401 GiB 396 GiB 76 MiB 1.2 GiB 1.4 TiB 21.47 0.98 0 down osd.1 > 2 hdd 1.81999 1.00000 1.8 TiB 365 GiB 362 GiB 45 KiB 1.1 GiB 1.5 TiB 19.59 0.90 0 down osd.2 > 3 hdd 1.81999 1.00000 1.8 TiB 584 GiB 580 GiB 825 KiB 1.7 GiB 1.3 TiB 31.29 1.43 0 down osd.3 > 4 hdd 1.81999 1.00000 1.8 TiB 365 GiB 362 GiB 621 KiB 1.1 GiB 1.5 TiB 19.57 0.90 0 down osd.4 > 5 hdd 1.81999 1.00000 1.8 TiB 583 GiB 579 GiB 31 KiB 1.7 GiB 1.3 TiB 31.25 1.43 0 down osd.5 > 6 hdd 1.81999 1.00000 1.8 TiB 365 GiB 362 GiB 43 KiB 1.1 GiB 1.5 TiB 19.55 0.90 0 down osd.6 > 7 hdd 1.81999 1.00000 1.8 TiB 365 GiB 362 GiB 12 KiB 1.1 GiB 1.5 TiB 19.58 0.90 0 down osd.7 > 8 hdd 1.81999 1.00000 1.8 TiB 365 GiB 362 GiB 27 KiB 1.1 GiB 1.5 TiB 19.58 0.90 0 down osd.8 > 9 hdd 1.81999 1.00000 1.8 TiB 547 GiB 543 GiB 47 KiB 1.6 GiB 1.3 TiB 29.32 1.34 0 down osd.9 >10 hdd 1.81999 1.00000 1.8 TiB 330 GiB 327 GiB 68 KiB 987 MiB 1.5 TiB 17.67 0.81 0 down osd.10 >12 ssd 0.17999 1.00000 186 GiB 1.2 GiB 67 MiB 1.1 GiB 86 MiB 185 GiB 0.65 0.03 0 down osd.12 >-3 22.03990 - 22 TiB 4.7 TiB 4.6 TiB 1.2 GiB 22 GiB 17 TiB 21.11 0.97 - host hw-ceph-ph4 >11 hdd 1.81999 1.00000 1.8 TiB 550 GiB 546 GiB 1012 KiB 2.3 GiB 1.3 TiB 29.48 1.35 34 up osd.11 >13 hdd 1.81999 1.00000 1.8 TiB 332 GiB 329 GiB 2.8 MiB 1.5 GiB 1.5 TiB 17.78 0.82 22 up osd.13 >14 hdd 1.81999 1.00000 1.8 TiB 550 GiB 547 GiB 4.4 MiB 2.1 GiB 1.3 TiB 29.50 1.35 25 up osd.14 >15 hdd 1.81999 1.00000 1.8 TiB 259 GiB 256 GiB 1.6 MiB 1.2 GiB 1.6 TiB 13.86 0.64 20 up osd.15 >16 hdd 1.81999 1.00000 1.8 TiB 477 GiB 474 GiB 2.7 MiB 1.9 GiB 1.4 TiB 25.58 1.17 23 up osd.16 >17 hdd 1.81999 1.00000 1.8 TiB 403 GiB 400 GiB 1.7 MiB 1.7 GiB 1.4 TiB 21.63 0.99 32 up osd.17 >18 hdd 1.81999 1.00000 1.8 TiB 294 GiB 291 GiB 79 MiB 1.4 GiB 1.5 TiB 15.79 0.72 25 up osd.18 >19 hdd 1.81999 1.00000 1.8 TiB 294 GiB 291 GiB 1.6 MiB 1.4 GiB 1.5 TiB 15.76 0.72 27 up osd.19 >20 hdd 1.81999 1.00000 1.8 TiB 477 GiB 474 GiB 2.3 MiB 1.9 GiB 1.4 TiB 25.60 1.17 23 up osd.20 >21 hdd 1.81999 1.00000 1.8 TiB 473 GiB 470 GiB 23 MiB 2.0 GiB 1.4 TiB 25.36 1.16 24 up osd.21 >22 hdd 1.81999 1.00000 1.8 TiB 404 GiB 401 GiB 3.6 MiB 2.3 GiB 1.4 TiB 21.69 0.99 22 up osd.22 >23 hdd 1.81999 1.00000 1.8 TiB 258 GiB 255 GiB 2.7 MiB 1.3 GiB 1.6 TiB 13.84 0.63 18 up osd.23 >24 ssd 0.20000 1.00000 238 GiB 2.2 GiB 67 MiB 1.1 GiB 1.0 GiB 236 GiB 0.91 0.04 32 up osd.24 > TOTAL 40 TiB 8.8 TiB 8.8 TiB 2.3 GiB 34 GiB 32 TiB 21.81 >MIN/MAX VAR: 0/1.43 STDDEV: 8.97 > > >We checked the network connectivity between the hosts, and everything is fine there. > >We have not restarted the OSD services on the second node because we suspect that the entire cluster might crash. > >Please give advise how to solve the problem _______________________________________________ >ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io