Todd Lipcon created KUDU-2193:
---------------------------------
Summary: Severe lock contention on TSTabletManager lock
Key: KUDU-2193
URL: https://issues.apache.org/jira/browse/KUDU-2193
Project: Kudu
Issue Type: Bug
Components: tserver
Affects Versions: 1.6.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Critical
I'm doing some stress/failure testing on a cluster with lots of tablets and ran
into the following mess:
- TSTabletManager::GenerateIncrementalTabletReport is holding the
TSTabletManager lock in 'read' mode
-- it's calling CreateReportedTabletPB on a bunch of tablets which are in the
process of an election storm
-- each such call blocks in RaftConsensus::ConsensusState since it's in the
process of fsyncing metadata to disk
-- thus it's holding the read lock on TSTabletManager lock for a long time
(many seconds if not tens of seconds)
- meanwhile, some other thread is trying to take TSTabletManager::lock for
write, and blocked due to the above reader
- rw_spinlock is writer-starvation-free which means that no more readers can
acquire the lock
What's worse is that rw_spinlock is a true spin lock, so now there are tens of
threads in a 'while (true) sched_yield()' loop, generating over 1.5M context
switches per second.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)