[ 
https://issues.apache.org/jira/browse/KUDU-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17923865#comment-17923865
 ] 

ASF subversion and git services commented on KUDU-3633:
-------------------------------------------------------

Commit fc40fcda30a93baabf50299a68af6023a44b369d in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=fc40fcda3 ]

KUDU-3633 shutdown DnsResolver in ServerBase::ShutdownImpl()

The thread pool of the DNS resolver should be shut down along with the
messenger in ServerBase to prevent retrying of RPCs that failed as a
collateral of the shutdown process in progress.  Those RPCs might be
retried by invoking rpc::Proxy::RefreshDnsAndEnqueueRequest(), etc.

On the related note, I also added a guard to protect ThreadPool::tokens_
in the destructor of the ThreadPool class, as elsewhere.  I also snuck
in an update to call DCHECK() in a loop only when DCHECK_IS_ON()
macro evaluates to 'true'.

This addresses flakiness reported at least in one of the RemoteKsckTest
scenarios (e.g., TestFilterOnNotabletTable in [1]).  One of the related
TSAN reports looked like below:

RemoteKsckTest.TestFilterOnNotabletTable: WARNING: ThreadSanitizer: data race
  Read of size 8 at 0x7b54001e5118 by main thread:
    #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::size() const
    #1 std::__1::unordered_set<kudu::ThreadPoolToken*, ...>::size() const
    #2 kudu::ThreadPool::~ThreadPool()
    ...
    #6 kudu::kserver::KuduServer::~KuduServer()
    #7 kudu::tserver::TabletServer::~TabletServer()
    ...

  Previous write of size 8 at 0x7b54001e5118 by thread T262 ...:
    #0 std::__1::__hash_table<kudu::ThreadPoolToken*, ...>::remove(...)
    ...
    #4 kudu::ThreadPool::ReleaseToken(...)
    #5 kudu::ThreadPoolToken::~ThreadPoolToken()
    ...
    #24 kudu::consensus::LeaderElection::~LeaderElection()
    ...
    #35 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...)
    ...
    #41 kudu::DnsResolver::RefreshAddressesAsync()
    ...

  Thread T262 'dns-resolver [w' (tid=29102, running) created by thread T182 at:
    #0 pthread_create
    #1 kudu::Thread::StartThread(...)
    #2 kudu::Thread::Create(...)
    #3 kudu::ThreadPool::CreateThread()
    #4 kudu::ThreadPool::DoSubmit(..., kudu::ThreadPoolToken*)
    #5 kudu::ThreadPool::Submit(...)
    #6 kudu::DnsResolver::RefreshAddressesAsync(..)
    #7 kudu::rpc::Proxy::RefreshDnsAndEnqueueRequest(...)
    #8 kudu::rpc::Proxy::AsyncRequest(...)
    ...
    #15 kudu::rpc::OutboundCall::CallCallback()
    #16 kudu::rpc::OutboundCall::SetFailed()
    #17 kudu::rpc::Connection::Shutdown()
    #18 kudu::rpc::ReactorThread::ShutdownInternal()
    ...
    #25 kudu::rpc::ReactorThread::RunThread()
    ...

[1] http://dist-test.cloudera.org:8080/test_drilldown?test_name=ksck_remote-test

Change-Id: I525f1078a349dbd2926938bb4fcc3e80888dfbb4
Reviewed-on: http://gerrit.cloudera.org:8080/22434
Tested-by: Alexey Serbin <ale...@apache.org>
Reviewed-by: Abhishek Chennaka <achenn...@cloudera.com>


> Threadpool check flakiness in ksck_remote-test during MiniMaster shutdown
> -------------------------------------------------------------------------
>
>                 Key: KUDU-3633
>                 URL: https://issues.apache.org/jira/browse/KUDU-3633
>             Project: Kudu
>          Issue Type: Sub-task
>            Reporter: Bakai Ádám
>            Assignee: Bakai Ádám
>            Priority: Major
>
> {code:java}
> F20241204 12:57:40.147302 16123 threadpool.cc:391] Check failed: 1 == 
> tokens_.size() (1 vs. 3) Threadpool raft destroyed with 3 allocated tokens
> *** Check failure stack trace: ***\{code}
> {code:java}
>     @     0x7f6b96b2cd64 google::LogMessage::SendToLog() at ??:0
>     @     0x7f6b96b2d910 google::LogMessage::Flush() at ??:0
>     @     0x7f6b96b32a4b google::LogMessageFatal::~LogMessageFatal() at ??:0
>     @     0x7f6b974a777d kudu::ThreadPool::~ThreadPool() at ??:0
> I20241204 12:57:40.556027 23288 raft_consensus.cc:1270] T 
> df574f38d0a746d1929d9494d82da991 P c273df5d41694d4da3bc1b5bc5e81b84 [term 2 
> FOLLOWER]: Refusing update from remote peer 2e54eeefd5f947279415fb606d3fe035: 
> Log matching property violated. Preceding OpId in replica: term: 1 index: 1. 
> Preceding OpId from leader: term: 2 index: 2. (index mismatch)
> I20241204 12:57:40.558073 23666 consensus_queue.cc:1035] T 
> df574f38d0a746d1929d9494d82da991 P 2e54eeefd5f947279415fb606d3fe035 [LEADER]: 
> Connected to new peer: Peer: permanent_uuid: 
> "c273df5d41694d4da3bc1b5bc5e81b84" member_type: VOTER last_known_addr { host: 
> "127.15.190.193" port: 33967 }, Status: LMP_MISMATCH, Last received: 0.0, 
> Next index: 2, Last known committed idx: 1, Time since last communication: 
> 0.000s
>     @     0x7f6b9ff4f6bf std::__1::default_delete<>::operator()() at ??:0
> I20241204 12:57:40.605798 23460 raft_consensus.cc:1270] T 
> df574f38d0a746d1929d9494d82da991 P 87f06d0d674a4791871f81a7af62b7be [term 2 
> FOLLOWER]: Refusing update from remote peer 2e54eeefd5f947279415fb606d3fe035: 
> Log matching property violated. Preceding OpId in replica: term: 1 index: 1. 
> Preceding OpId from leader: term: 2 index: 2. (index mismatch)
> I20241204 12:57:40.611544 23707 consensus_queue.cc:1035] T 
> df574f38d0a746d1929d9494d82da991 P 2e54eeefd5f947279415fb606d3fe035 [LEADER]: 
> Connected to new peer: Peer: permanent_uuid: 
> "87f06d0d674a4791871f81a7af62b7be" member_type: VOTER last_known_addr { host: 
> "127.15.190.195" port: 35365 }, Status: LMP_MISMATCH, Last received: 0.0, 
> Next index: 2, Last known committed idx: 1, Time since last communication: 
> 0.000s
>     @     0x7f6b9ff4f62e std::__1::unique_ptr<>::reset() at ??:0
>     @     0x7f6b9ff0e2cc std::__1::unique_ptr<>::~unique_ptr() at ??:0
>     @     0x7f6b9ffb65b4 kudu::kserver::KuduServer::~KuduServer() at ??:0
>     @     0x7f6b9ffac863 kudu::master::Master::~Master() at ??:0
>     @     0x7f6b9ffacb5a kudu::master::Master::~Master() at ??:0
>     @     0x7f6b9ffea408 std::__1::default_delete<>::operator()() at ??:0
>     @     0x7f6b9ffe34ce std::__1::unique_ptr<>::reset() at ??:0
>     @     0x7f6ba00773c3 kudu::master::MiniMaster::Shutdown() at ??:0
>     @           0x354ea9 
> kudu::tools::RemoteKsckTest_TestClusterWithLocation_Test::TestBody() at 
> /root/tmp/test123/kudu/src/kudu/tools/ksck_remote-test.cc:607
>     @     0x7f6ba045adc0 
> testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
>     @     0x7f6ba04389c2 testing::Test::Run() at ??:0
>     @     0x7f6ba0439cd9 testing::TestInfo::Run() at ??:0
>     @     0x7f6ba043acb5 testing::TestSuite::Run() at ??:0
>     @     0x7f6ba044f7a5 testing::internal::UnitTestImpl::RunAllTests() at 
> ??:0
>     @     0x7f6ba045bc80 
> testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
>     @     0x7f6ba044ed5d testing::UnitTest::Run() at ??:0
>     @           0x3801bc RUN_ALL_TESTS() at 
> /root/tmp/test123/kudu/thirdparty/installed/tsan/include/gtest/gtest.h:?
>     @           0x37f0bd main at 
> /root/tmp/test123/kudu/src/kudu/util/test_main.cc:?
>     @     0x7f6b93f58bf7 __libc_start_main at ??:0
>     @           0x298ada _start at ??:? {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to