> On March 21, 2018, 11:52 a.m., Benjamin Bannier wrote: > > src/tests/master_allocator_tests.cpp > > Line 759 (original), 748 (patched) > > <https://reviews.apache.org/r/66165/diff/1/?file=1983351#file1983351line764> > > > > This test seems to get flaky for me with this patch, could you please > > confirm it works under load (e.g., using `stress` or some actual workload)? > > I haven't verified all touched tests, please do. > > > > [ RUN ] MasterAllocatorTest/0.SlaveLost > > ../src/tests/master_allocator_tests.cpp:838: Failure > > Mock function called more times than expected - taking default > > action specified at: > > ../src/tests/allocator.hpp:273: > > Function call: addSlave(@0x7f2414006ab8 > > 6d430237-e4d5-4852-8459-2020f598449f-S2, @0x7f2414006ad8 hostname: > > "gru1.hw.ca1.mesosphere.com" > > resources { > > name: "cpus" > > type: SCALAR > > scalar { > > value: 3 > > } > > } > > resources { > > name: "mem" > > type: SCALAR > > scalar { > > value: 256 > > } > > } > > resources { > > name: "disk" > > type: SCALAR > > scalar { > > value: 1024 > > } > > } > > resources { > > name: "ports" > > type: RANGES > > ranges { > > range { > > begin: 31000 > > end: 32000 > > } > > } > > } > > id { > > value: "6d430237-e4d5-4852-8459-2020f598449f-S2" > > } > > checkpoint: true > > port: 39521 > > , @0x7f2423e76c28 { 32-byte object <78-A9 BC-2B 24-7F 00-00 00-00 > > 00-00 00-00 00-00 01-00 00-00 00-00 00-00 01-00 00-00 24-7F 00-00>, 32-byte > > object <78-A9 BC-2B 24-7F 00-00 00-00 00-00 00-00 00-00 01-00 00-00 00-00 > > 00-00 02-00 00-00 24-7F 00-00>, 32-byte object <78-A9 BC-2B 24-7F 00-00 > > 00-00 00-00 00-00 00-00 01-00 00-00 00-00 00-00 03-00 00-00 00-00 00-00> }, > > @0x7f2423e76f20 48-byte object <01-00 00-00 24-7F 00-00 00-00 00-00 00-00 > > 00-00 BF-83 8E-4D FE-7F 00-00 C0-89 E7-23 24-7F 00-00 00-87 E7-23 24-7F > > 00-00 8C-52 15-29 24-7F 00-00>, @0x7f2414006e98 { cpus:3, mem > > :256, disk:1024, ports:[31000-32000] }, @0x7f2414006e30 {}) > > Expected: to be called once > > Actual: called twice - over-saturated and active > > *** Aborted at 1521624413 (unix time) try "date -d @1521624413" if > > you are using GNU date *** > > PC: @ 0x2cb968b testing::UnitTest::AddTestPartResult() > > *** SIGSEGV (@0x0) received by PID 14803 (TID 0x7f2423e78700) from > > PID 0; stack trace: *** > > @ 0x7f242cba25d0 (unknown) > > @ 0x2cb968b testing::UnitTest::AddTestPartResult() > > @ 0x2cb9219 > > testing::internal::AssertHelper::operator=() > > @ 0x2cfc809 > > testing::internal::GoogleTestFailureReporter::ReportFailure() > > @ 0xe36438 testing::internal::Expect() > > @ 0x2cf6ef4 > > testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() > > @ 0x135367a > > _ZN7testing8internal18FunctionMockerBaseIFvRKN5mesos7SlaveIDERKNS2_9SlaveInfoERKSt6vectorINS2_20SlaveInfo_CapabilityESaISA_EERK6OptionINS2_14UnavailabilityEERKNS2_9ResourcesERK7hashmapINS2_11FrameworkIDESK_St4hashISO_ESt8equal_toISO_EEEE10InvokeWithERKSt5tupleIJS5_S8_SE_SJ_SM_SV_EE > > @ 0x135362b > > testing::internal::FunctionMocker<>::Invoke() > > @ 0x12ebc75 > > mesos::internal::tests::TestAllocator<>::addSlave() > > @ 0x7f2433f04cad mesos::internal::master::Master::addSlave() > > @ 0x7f2433f030e6 > > mesos::internal::master::Master::__registerSlave() > > @ 0x7f243402d3b3 > > _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_4UPIDEONS2_20RegisterSlaveMessageERKNS_6FutureIbEES7_S8_SD_EEvRKNS_3PIDIT_EEMSF_FvT0_T1_T2_EOT3_OT4_OT5_ENKUlOS5_S9_OSB_PNS_11ProcessBaseEE_clESU_S9_SV_SX_ > > @ 0x7f243402cfa1 > > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_4UPIDEONS4_20RegisterSlaveMessageERKNS1_6FutureIbEES9_SA_SF_EEvRKNS1_3PIDIT_EEMSH_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS7_SB_OSD_PNS1_11ProcessBaseEE_JS7_SA_SD_SZ_EEEDTclclsr3stdE7forwardISH_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSH_DpOS11_ > > @ 0x7f243402cf0d > > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_4UPIDEONS5_20RegisterSlaveMessageERKNS2_6FutureIbEESA_SB_SG_EEvRKNS2_3PIDIT_EEMSI_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS8_SC_OSE_PNS2_11ProcessBaseEE_JS8_SB_SE_St12_PlaceholderILi1EEEE13invoke_expandIS11_St5tupleIJS8_SB_SE_S13_EES16_IJOS10_EEJLm0ELm1ELm2ELm3EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardISI_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISM_Efp0_EEclsr3stdE7forwardISN_Efp2_EEEEOSI_OSM_N5cpp1416integer_sequenceImJXspT2_EEEEOSN_ > > @ 0x7f243402cdf2 > > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_4UPIDEONS5_20RegisterSlaveMessageERKNS2_6FutureIbEESA_SB_SG_EEvRKNS2_3PIDIT_EEMSI_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS8_SC_OSE_PNS2_11ProcessBaseEE_JS8_SB_SE_St12_PlaceholderILi1EEEEclIJS10_EEEDTcl13invoke_expandclL_ZSt4moveIRS11_EONSt16remove_referenceISI_E4typeEOSI_EdtdefpT1fEclL_ZS16_IRSt5tupleIJS8_SB_SE_S13_EEES1B_S1C_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOS1J_ > > @ 0x7f243402cd72 > > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_4UPIDEONS7_20RegisterSlaveMessageERKNS4_6FutureIbEESC_SD_SI_EEvRKNS4_3PIDIT_EEMSK_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSA_SE_OSG_PNS4_11ProcessBaseEE_JSA_SD_SG_St12_PlaceholderILi1EEEEEJS12_EEEDTclclsr3stdE7forwardISK_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOS17_ > > @ 0x7f243402cd36 > > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_4UPIDEONS8_20RegisterSlaveMessageERKNS5_6FutureIbEESD_SE_SJ_EEvRKNS5_3PIDIT_EEMSL_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSB_SF_OSH_PNS5_11ProcessBaseEE_JSB_SE_SH_St12_PlaceholderILi1EEEEEJS13_EEEvOSL_DpOT0_ > > @ 0x7f243402cafa > > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_4UPIDEONSB_20RegisterSlaveMessageERKNS1_6FutureIbEESG_SH_SM_EEvRKNS1_3PIDIT_EEMSO_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSE_SI_OSK_S3_E_JSE_SH_SK_St12_PlaceholderILi1EEEEEEclEOS3_ > > @ 0x7f242dfcc55d > > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ > > @ 0x7f242dfae809 process::ProcessBase::consume() > > @ 0x7f242e032549 > > _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE > > @ 0xdda4d6 process::ProcessBase::serve() > > @ 0x7f242dfab2bd process::ProcessManager::resume() > > @ 0x7f242dfb4d3e > > process::ProcessManager::init_threads()::$_1::operator()() > > @ 0x7f242dfb4be5 > > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_1vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > > @ 0x7f242dfb4bb5 std::_Bind_simple<>::operator()() > > @ 0x7f242dfb4aa9 std::thread::_State_impl<>::_M_run() > > @ 0x7f2429a6e90f execute_native_thread_routine > > @ 0x7f242cb9873a start_thread > > @ 0x7f24291d6e7f __GI___clone > > [2] 14803 segmentation fault (core dumped) ./src/mesos-tests > > --gtest_filter='*MasterAllocatorTest/0*' --gtest_repeat=-1 > > Till Toenshoff wrote: > This RR reverts all changes on tests that use multiple slaves - > `SlaveLost` is one of them. The pattern chosen for the simpler tests is > allowing for multiple `AddSlave` events, working around the "test teardown > vs. slave registration-retry" race. That however can not generally be applied > towards tests with multiple slaves - we would end up not knowing if > additional `AddSlave` were expected or to be ignored. We need to fix those as > well nevertheless.
Dropping this as I cannot reproduce it myself anymore. I suspect now that above failure was caused by an incorrect build. - Benjamin ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66165/#review199649 ----------------------------------------------------------- On March 20, 2018, 9:36 p.m., Till Toenshoff wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66165/ > ----------------------------------------------------------- > > (Updated March 20, 2018, 9:36 p.m.) > > > Review request for mesos, Alexander Rukletsov and Benjamin Bannier. > > > Bugs: MESOS-8613 > https://issues.apache.org/jira/browse/MESOS-8613 > > > Repository: mesos > > > Description > ------- > > When the slave has a very short lifetime, its scheduled registration > retry might occur when the test is tearing down. These unintuitively > motivated registrations in turn cause additional invocations of > `AddSlave` on the allocator. > Additionally, this also reverts the newly introduced Clock pauses as > they have shown to be problematic. > > > Diffs > ----- > > src/tests/master_allocator_tests.cpp > 1ceb8e8a57ab300a957931d5ad3d54904e555597 > > > Diff: https://reviews.apache.org/r/66165/diff/1/ > > > Testing > ------- > > make check > > Ran the MasterAllocatorTests 10k times without any hiccups. > > > Thanks, > > Till Toenshoff > >
