Chia-Ping Tsai created KAFKA-17142: -------------------------------------- Summary: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics Key: KAFKA-17142 URL: https://issues.apache.org/jira/browse/KAFKA-17142 Project: Kafka Issue Type: Test Reporter: Chia-Ping Tsai Assignee: Chia-Ping Tsai
`testLogRecoveryMetrics` uses mock scheduler to create UnifiedLog [0], and the mock scheduler does NOT have true thread poll, and that is the root cause! We create recovery threads [1] for each data folder, and they will pass the action: `writeToFileForTruncation` to scheduler [2]. The action requires the read lock [3], so the deadlock is produced when one thread executes the action from another thread. For example: 1. thread_a is handling dir_a, and it holds the writelock_a 2. thread_a pass action_a: `writeToFileForTruncation` to mock scheduler 3. thread_b is handling dir_b, and it holds the writelock_b 4. thread_b pass action_b: `writeToFileForTruncation` to mock scheduler 5. thread_b (holding writelock_b) handle the action_a, so it requires the readlock_a 6. thread_a (holding writelock_a) handle the action_b, so it requires the readlock_b so lucky we have a deadlock :cry This deadlock happens due to the mock scheduler, so that is a issue belonging to test. We can fix it by a simple solution - add some delay when creating next UnifiedLog Or we can do a bit refactor to production code: create snapshot of `epochs` when we are holding write lock! That means `writeToFileForTruncation` does not take lock anymore. For example: {code:java} public void truncateFromStartAsyncFlush(long startOffset) { lock.writeLock().lock(); try { List<EpochEntry> removedEntries = truncateFromStart(epochs, startOffset); if (!removedEntries.isEmpty()) { ... List<EpochEntry> entries = new ArrayList<>(epochs.values()); scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, () -> checkpoint.writeForTruncation(entries)); ... } } finally { lock.writeLock().unlock(); } } {code} [0] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/core/src/test/scala/unit/kafka/log/LogManagerTest.scala#L968 [1] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/core/src/main/scala/kafka/log/LogManager.scala#L434 [2] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java#L351 [3] https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java#L528 -- This message was sent by Atlassian Jira (v8.20.10#820010)