Do you have more logs that indicate what state machine event the crashing
OSDs received? This obviously shouldn't have happened, but it's a plausible
failure mode, especially if it's a relatively rare combination of events.
-Greg

On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne <
kvanalst...@knightpoint.com> wrote:

> Hello all:
>         I ran into an issue recently with one of my clusters when
> upgrading from 10.2.10 to 12.2.7.  I have previously tested the upgrade in
> a lab and upgraded one of our five production clusters with no issues.  On
> the second cluster, however, I ran into an issue where all OSDs that were
> NOT running Luminous yet (which was about 40% of the cluster at the time)
> all crashed with the same backtrace, which I have pasted below:
>
> ===
>      0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In function
> 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed,
> PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f145c9ec700 time
> 2018-08-13 17:35:13.157319
> osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event")
>
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7f) [0x55b9bf08614f]
>  2:
> (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed,
> PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4]
>  3: (()+0x447366) [0x55b9bea9a366]
>  4: (boost::statechart::simple_state<PG::RecoveryState::Stray,
> PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x2f7) [0x55b9beac8b77]
>  5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
> PG::RecoveryState::Initial, std::allocator<void>,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x6b) [0x55b9beaab5bb]
>  6: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
> PG::RecoveryCtx*)+0x384) [0x55b9bea7db14]
>  7: (OSD::process_peering_events(std::__cxx11::list<PG*,
> std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723]
>  8: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
> ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a]
>  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40]
>  10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0]
>  11: (()+0x7507) [0x7f14e2c96507]
>  12: (clone()+0x3f) [0x7f14e0ca214f]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
> ===
>
> Once I restarted the impacted OSDs, which brought them up to 12.2.7,
> everything recovered just fine and the cluster is healthy.  The only rub is
> that losing that many OSDs simultaneously caused a significant I/O
> disruption to the production servers for several minutes while I brought up
> the remaining OSDs.  I have been trying to duplicate this issue in a lab
> again before continuing the upgrades on the other three clusters, but am
> coming up short.  Has anyone seen anything like this and am I missing
> something obvious?
>
> Given how quickly the issue happened and the fact that I’m having a hard
> time reproducing this issue, I am limited in the amount of logging and
> debug information I have available, unfortunately.  If it helps, all
> ceph-mon, ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7,
> while 30 of the 50 total ceph-osd daemons were also on 12.2.7 when the
> remaining 20 ceph-osd daemons (on 10.2.10) crashed.
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> <https://maps.google.com/?q=1775+Wiehle+Avenue+Suite+101+%7C+Reston,+VA+20190&entry=gmail&source=g>
> c: 228-547-8045 <(228)%20547-8045> f: 571-266-3106 <(571)%20266-3106>
> www.knightpoint.com
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 20000 / ISO 27001 / CMMI Level 3
>
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure, or
> distribution is STRICTLY prohibited. If you are not the intended recipient,
> please contact the sender by reply e-mail and destroy all copies of the
> original message.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to