[ https://issues.apache.org/jira/browse/KUDU-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317556#comment-17317556 ]
ASF subversion and git services commented on KUDU-3268: ------------------------------------------------------- Commit 6aeb466c538d8cc566031c4a82200f18f088e18a in kudu's branch refs/heads/master from Andrew Wong [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=6aeb466 ] KUDU-3268: fix race between MM scheduling and unregistration Following commit 9e4664d the following race was possible (S: scheduler thread, T: another thread): S: finds best op is op1 T: unregisters op1 because the tablet is being shut down T: takes running_instances_lock_ - waits for running count to go to 0 - destructs op1 S: takes running_instances_lock_ - increments running count to 1 S: attempts to run op1 but segfaults This resulted in a crash like the following, when shutting down the tserver (though it seems possible this could manifest itself when deleting a TabletReplica): PC: @ 0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() *** SIGSEGV (@0x30) received by PID 21067 (TID 0x7ff96343b700) from PID 48; stack trace: *** @ 0x7ff976846980 (unknown) at ??:0 @ 0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() at ??:0 @ 0x7ff97596b538 _ZZN4kudu18MaintenanceManager18RunSchedulerThreadEvENKUlvE_clEv at ??:0 @ 0x7ff97596f124 _ZNSt17_Function_handlerIFvvEZN4kudu18MaintenanceManager18RunSchedulerThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0 @ 0x7ff977e2bcf4 std::function<>::operator()() at ??:0 @ 0x7ff975a05e6e kudu::ThreadPool::DispatchThread() at ??:0 @ 0x7ff975a06757 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv at ??:0 @ 0x7ff975a07e7b _ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data at ??:0 @ 0x7ff977e2bcf4 std::function<>::operator()() at ??:0 @ 0x7ff9759f8913 kudu::Thread::SuperviseThread() at ??:0 @ 0x7ff97683b6db start_thread at ??:0 @ 0x7ff97388e71f clone at ??:0 This patch fixes this by checking whether the op has been cancelled, while holding running_instanes_lock_, before proceeding with the scheduling. I added a new test that failed consistently within 100 runs, that passed 5000 times locally. Change-Id: I29ad4d545de8166832b7c4148880a7e8719353bf Reviewed-on: http://gerrit.cloudera.org:8080/17291 Reviewed-by: Alexey Serbin <aser...@cloudera.com> Tested-by: Andrew Wong <aw...@cloudera.com> > Crash in TabletServerDiskErrorTest.TestRandomOpSequence > ------------------------------------------------------- > > Key: KUDU-3268 > URL: https://issues.apache.org/jira/browse/KUDU-3268 > Project: Kudu > Issue Type: Bug > Components: test, tserver > Reporter: Andrew Wong > Priority: Major > Attachments: tablet_server-test.3.txt.gz > > > A pre-commit failed with the following crash when attempting to launch an op > after stopping a replica: > {code:java} > I0323 18:15:01.078991 23854 maintenance_manager.cc:373] P > c8a93089db0041f5930b9fb1832714ed: Scheduling > CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452 > I0323 18:15:01.079111 21067 tablet_server-test.cc:852] Tablet server > responded with: timestamp: 6621279441214984192 > I0323 18:15:01.079317 23789 maintenance_manager.cc:594] P > c8a93089db0041f5930b9fb1832714ed: > UndoDeltaBlockGCOp(ffffffffffffffffffffffffffffffff) complete. Timing: real > 0.000s user 0.000s sys 0.000s Metrics: > {"cfile_init":1,"lbm_read_time_us":73,"lbm_reads_lt_1ms":4} > E0323 18:15:01.080865 23788 cfile_reader.cc:591] Encountered corrupted CFile > in filesystem block: 4124746176525068430 > I0323 18:15:01.080960 23788 ts_tablet_manager.cc:1774] T > ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: failing > tablet > I0323 18:15:01.080950 21067 tablet_server-test.cc:852] Tablet server > responded with: timestamp: 6621279441223315456 > I0323 18:15:01.081243 24138 tablet_replica.cc:324] T > ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: stopping > tablet replica > I0323 18:15:01.081670 21067 tablet_server-test.cc:852] Tablet server > responded with: error { > code: TABLET_NOT_RUNNING > status { > code: ILLEGAL_STATE > message: "Tablet not RUNNING: STOPPING" > } > } > I0323 18:15:01.081777 21067 tablet_server-test.cc:890] Failure was caught by > an op! > W0323 18:15:01.082907 23788 tablet_mm_ops.cc:176] T > ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed: > Compaction failed on ffffffffffffffffffffffffffffffff: Corruption: Flush to > disk failed: checksum error on CFile block 4124746176525068430 at offset=1006 > size=24: Checksum does not match: 3582029077 vs expected 3582029077 > I0323 18:15:01.082957 23788 maintenance_manager.cc:594] P > c8a93089db0041f5930b9fb1832714ed: > CompactRowSetsOp(ffffffffffffffffffffffffffffffff) complete. Timing: real > 0.004s user 0.003s sys 0.000s Metrics: > {"cfile_cache_miss":3,"cfile_cache_miss_bytes":92,"delta_iterators_relevant":2,"dirs.queue_time_us":630,"dirs.run_cpu_time_us":368,"dirs.run_wall_time_us":2220,"lbm_read_time_us":54,"lbm_reads_lt_1ms":3,"lbm_write_time_us":168,"lbm_writes_lt_1ms":6,"num_input_rowsets":2,"spinlock_wait_cycles":1792,"tablet-open.queue_time_us":135,"thread_start_us":382,"threads_started":5} > I0323 18:15:01.083369 23854 maintenance_manager.cc:373] P > c8a93089db0041f5930b9fb1832714ed: Scheduling > CompactRowSetsOp(ffffffffffffffffffffffffffffffff): perf score=1.012452 > *** Aborted at 1616523301 (unix time) try "date -d @1616523301" if you are > using GNU date *** > I0323 18:15:01.083519 24138 raft_consensus.cc:2226] T > ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 > LEADER]: Raft consensus shutting down. > I0323 18:15:01.083653 24138 raft_consensus.cc:2255] T > ffffffffffffffffffffffffffffffff P c8a93089db0041f5930b9fb1832714ed [term 1 > FOLLOWER]: Raft consensus is shut down! > I0323 18:15:01.085090 21067 tablet_server-test.cc:894] Tablet was > successfully failed > I0323 18:15:01.085439 21067 tablet_server.cc:166] TabletServer@127.0.0.1:0 > shutting down... > PC: @ 0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() > *** SIGSEGV (@0x30) received by PID 21067 (TID 0x7ff96343b700) from PID 48; > stack trace: *** > @ 0x7ff976846980 (unknown) at ??:0 > @ 0x7ff97596de0d kudu::MaintenanceManager::LaunchOp() at ??:0 > @ 0x7ff97596b538 > _ZZN4kudu18MaintenanceManager18RunSchedulerThreadEvENKUlvE_clEv at ??:0 > @ 0x7ff97596f124 > _ZNSt17_Function_handlerIFvvEZN4kudu18MaintenanceManager18RunSchedulerThreadEvEUlvE_E9_M_invokeERKSt9_Any_data > at ??:0 > @ 0x7ff977e2bcf4 std::function<>::operator()() at ??:0 > @ 0x7ff975a05e6e kudu::ThreadPool::DispatchThread() at ??:0 > @ 0x7ff975a06757 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv at > ??:0 > @ 0x7ff975a07e7b > _ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data > at ??:0 > @ 0x7ff977e2bcf4 std::function<>::operator()() at ??:0 > @ 0x7ff9759f8913 kudu::Thread::SuperviseThread() at ??:0 > @ 0x7ff97683b6db start_thread at ??:0 > @ 0x7ff97388e71f clone at ??:0 > {code} > Attached the full logs -- seems there's something unsafe about how we > unregister ops (maybe from the fix for KUDU-3149?) when racing with the > scheduler thread. -- This message was sent by Atlassian Jira (v8.3.4#803005)