Do you have more logs that indicate what state machine event the crashing OSDs received? This obviously shouldn't have happened, but it's a plausible failure mode, especially if it's a relatively rare combination of events. -Greg
On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne < kvanalst...@knightpoint.com> wrote: > Hello all: > I ran into an issue recently with one of my clusters when > upgrading from 10.2.10 to 12.2.7. I have previously tested the upgrade in > a lab and upgraded one of our five production clusters with no issues. On > the second cluster, however, I ran into an issue where all OSDs that were > NOT running Luminous yet (which was about 40% of the cluster at the time) > all crashed with the same backtrace, which I have pasted below: > > === > 0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In function > 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, > PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f145c9ec700 time > 2018-08-13 17:35:13.157319 > osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event") > > ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x7f) [0x55b9bf08614f] > 2: > (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, > PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4] > 3: (()+0x447366) [0x55b9bea9a366] > 4: (boost::statechart::simple_state<PG::RecoveryState::Stray, > PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, > mpl_::na, mpl_::na, mpl_::na>, > (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base > const&, void const*)+0x2f7) [0x55b9beac8b77] > 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, > PG::RecoveryState::Initial, std::allocator<void>, > boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base > const&)+0x6b) [0x55b9beaab5bb] > 6: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, > PG::RecoveryCtx*)+0x384) [0x55b9bea7db14] > 7: (OSD::process_peering_events(std::__cxx11::list<PG*, > std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723] > 8: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, > ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a] > 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40] > 10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0] > 11: (()+0x7507) [0x7f14e2c96507] > 12: (clone()+0x3f) [0x7f14e0ca214f] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed > to interpret this. > === > > Once I restarted the impacted OSDs, which brought them up to 12.2.7, > everything recovered just fine and the cluster is healthy. The only rub is > that losing that many OSDs simultaneously caused a significant I/O > disruption to the production servers for several minutes while I brought up > the remaining OSDs. I have been trying to duplicate this issue in a lab > again before continuing the upgrades on the other three clusters, but am > coming up short. Has anyone seen anything like this and am I missing > something obvious? > > Given how quickly the issue happened and the fact that I’m having a hard > time reproducing this issue, I am limited in the amount of logging and > debug information I have available, unfortunately. If it helps, all > ceph-mon, ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7, > while 30 of the 50 total ceph-osd daemons were also on 12.2.7 when the > remaining 20 ceph-osd daemons (on 10.2.10) crashed. > > Thanks, > > -- > Kenneth Van Alstyne > Systems Architect > Knight Point Systems, LLC > Service-Disabled Veteran-Owned Business > 1775 Wiehle Avenue Suite 101 | Reston, VA 20190 > <https://maps.google.com/?q=1775+Wiehle+Avenue+Suite+101+%7C+Reston,+VA+20190&entry=gmail&source=g> > c: 228-547-8045 <(228)%20547-8045> f: 571-266-3106 <(571)%20266-3106> > www.knightpoint.com > DHS EAGLE II Prime Contractor: FC1 SDVOSB Track > GSA Schedule 70 SDVOSB: GS-35F-0646S > GSA MOBIS Schedule: GS-10F-0404Y > ISO 20000 / ISO 27001 / CMMI Level 3 > > Notice: This e-mail message, including any attachments, is for the sole > use of the intended recipient(s) and may contain confidential and > privileged information. Any unauthorized review, copy, use, disclosure, or > distribution is STRICTLY prohibited. If you are not the intended recipient, > please contact the sender by reply e-mail and destroy all copies of the > original message. > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com