> On Sept. 9, 2020, 12:11 a.m., Greg Mann wrote: > > src/tests/master_tests.cpp > > Lines 7757-7759 (patched) > > <https://reviews.apache.org/r/72830/diff/1/?file=2238998#file2238998line7757> > > > > Could you include a comment to explain the need for the clock > > operations here and below?
Good point, actually one of these shouldn't be needed! - Benjamin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/72830/#review221821 ----------------------------------------------------------- On Sept. 8, 2020, 11:49 p.m., Benjamin Mahler wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/72830/ > ----------------------------------------------------------- > > (Updated Sept. 8, 2020, 11:49 p.m.) > > > Review request for mesos and Greg Mann. > > > Bugs: MESOS-9609 > https://issues.apache.org/jira/browse/MESOS-9609 > > > Repository: mesos > > > Description > ------- > > Per MESOS-9609, it's possible for the master to encounter a CHECK > failure during agent removal in the following situation: > > 1. Given a framework with checkpoint == false, with only > executor(s) (no tasks) running on an agent: > 2. When this agent disconects from the master, > Master::removeFramework(Slave*, Framework*) removes the > tasks and executors. However, when there are no tasks, this > function will accidentally insert an entry into > Master::Slave::tasks! (Due to the [] operator usage) > 3. Now if the framework is removed, we have an entry in > Slave::tasks, for which there is no corresponding framework. > 4. When the agent is removed, we have a CHECK failure given > we can't find the framework. > > This test runs through the above scenario, which no longer crashes > given the fix applied. > > > Diffs > ----- > > src/tests/master_tests.cpp 785e5d5d72b5e388067db05579dd457e78321a0c > > > Diff: https://reviews.apache.org/r/72830/diff/1/ > > > Testing > ------- > > Succeeds in repetition after the fix. > > Before the fix: > > ``` > [ RUN ] > MasterTest.NonCheckpointingFrameworkAgentDisconnectionExecutorOnly > F0831 17:53:43.287982 9983 master.cpp:10926] Check failed: framework != > nullptr Framework aa4bd8e2-48b7-4dcf-b45f-6ebacc50f08e-0000 not found while > removing agent aa4bd8e2-48b7-4dcf-b45f-6ebacc50f08e-S0 at > slave(1)@10.0.49.2:33564 (core-dev); agent tasks: { > aa4bd8e2-48b7-4dcf-b45f-6ebacc50f08e-0000: { } } > *** Check failure stack trace: *** > @ 0x7f35e4d64a3d google::LogMessage::Fail() > @ 0x7f35e4d63e23 google::LogMessage::SendToLog() > @ 0x7f35e4d64712 google::LogMessage::Flush() > @ 0x7f35e4d67ec8 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f35e1a1f488 mesos::internal::master::Master::__removeSlave() > @ 0x7f35e1a22302 mesos::internal::master::Master::markGone() > @ 0x7f35e18bc4fe > mesos::internal::master::Master::Http::_markAgentGone()::$_64::operator()() > @ 0x7f35e18bd353 > _ZN5cpp176invokeIZNK5mesos8internal6master6Master4Http14_markAgentGoneERKNS1_7SlaveIDEE4$_64JN7process6FutureIbEEEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSD_DpOSE_ > @ 0x7f35e18bd2f7 > _ZN6lambda8internal7PartialIZNK5mesos8internal6master6Master4Http14_markAgentGoneERKNS2_7SlaveIDEE4$_64JN7process6FutureIbEEEE13invoke_expandISA_St5tupleIJSD_EESG_IJEEJLm0EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEEOSJ_OSK_N5cpp1416integer_sequenceImJXspT2_EEEEOSL_ > @ 0x7f35e18bd27d > _ZNO6lambda8internal7PartialIZNK5mesos8internal6master6Master4Http14_markAgentGoneERKNS2_7SlaveIDEE4$_64JN7process6FutureIbEEEEclIJEEEDTcl13invoke_expandclL_ZSt4moveIRSA_EONSt16remove_referenceIT_E4typeEOSJ_EdtdefpT1fEclL_ZSG_IRSt5tupleIJSD_EEESM_SN_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOSU_ > @ 0x7f35e18bd21d > _ZN5cpp176invokeIN6lambda8internal7PartialIZNK5mesos8internal6master6Master4Http14_markAgentGoneERKNS4_7SlaveIDEE4$_64JN7process6FutureIbEEEEEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSH_DpOSI_ > @ 0x7f35e18bd1f1 lambda::internal::Invoke<>::operator()<>() > @ 0x7f35e18bd1ae > _ZNO6lambda12CallableOnceIFvvEE10CallableFnINS_8internal7PartialIZNK5mesos8internal6master6Master4Http14_markAgentGoneERKNS6_7SlaveIDEE4$_64JN7process6FutureIbEEEEEEclEv > @ 0x55bf51f2829d _ZNO6lambda12CallableOnceIFvvEEclEv > @ 0x55bf52454065 > _ZZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS_4UPIDEOT_ENKUlOS7_PNS_11ProcessBaseEE_clESD_SF_ > @ 0x55bf52453ee7 > _ZN5cpp176invokeIZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS1_4UPIDEOT_EUlOS9_PNS1_11ProcessBaseEE_JS9_SH_EEEDTclclsr3stdE7forwardISD_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESE_DpOSJ_ > @ 0x55bf52453e89 > _ZN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1EEEE13invoke_expandISI_St5tupleIJS9_SK_EESN_IJOSH_EEJLm0ELm1EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardISD_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEESE_OSR_N5cpp1416integer_sequenceImJXspT2_EEEEOSS_ > @ 0x55bf52453ddb > _ZNO6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1EEEEclIJSH_EEEDTcl13invoke_expandclL_ZSt4moveIRSI_EONSt16remove_referenceISD_E4typeESE_EdtdefpT1fEclL_ZSN_IRSt5tupleIJS9_SK_EEESS_SE_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOSZ_ > @ 0x55bf52453d62 > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS1_12CallableOnceIFvvEEEEEvRKNS4_4UPIDEOT_EUlOSB_PNS4_11ProcessBaseEE_JSB_St12_PlaceholderILi1EEEEEJSJ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESG_DpOSO_ > @ 0x55bf52453d26 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS5_4UPIDEOT_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EEEEEJSK_EEEvSH_DpOT0_ > @ 0x55bf52453bda > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEEEEEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_St12_PlaceholderILi1EEEEEEclEOS3_ > @ 0x7f35e4c33d1b > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ > @ 0x7f35e4bf6bc9 process::ProcessBase::consume() > @ 0x7f35e4c72019 > _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE > @ 0x55bf51f23374 process::ProcessBase::serve() > @ 0x7f35e4bf36ff process::ProcessManager::resume() > @ 0x7f35e4c1c45b > process::ProcessManager::init_threads()::$_16::operator()() > @ 0x7f35e4c1c305 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE4$_16vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x7f35e4c1c2d5 std::_Bind_simple<>::operator()() > @ 0x7f35e4c1c1c9 std::thread::_Impl<>::_M_run() > @ 0x7f35e510b6a0 execute_native_thread_routine > @ 0x7f35d8220e65 start_thread > ``` > > > Thanks, > > Benjamin Mahler > >
