[ 
https://issues.apache.org/jira/browse/KAFKA-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019596#comment-16019596
 ] 

Raymond Conn commented on KAFKA-5298:
-------------------------------------

Thanks, this definitely fixes part of it since the producer can now recover if 
the topic gets created at the destination and the logs will keep warning until 
the topic gets created. 

I don't think it fixes the deadlock at shutdown however, since if the topic 
never gets created the mirrorMaker thread will never exit the producer (since 
{{max.block.ms}} is max long) and as a result can never countdown the shutdown 
latch 

> MirrorMaker deadlocks with missing topics
> -----------------------------------------
>
>                 Key: KAFKA-5298
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5298
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, tools
>    Affects Versions: 0.10.2.1
>            Reporter: Raymond Conn
>
> When mirrorMaker mirrors a topic to destination brokers that have topic auto 
> create disabled and a topic doesn't exist on the destination brokers, the 
> producer in mirror maker logs the following 
> {code}
> Error while fetching metadata with correlation id 467 : 
> \{mirror-test2=UNKNOWN_TOPIC_OR_PARTITION\}
> Error while fetching metadata with correlation id 468 : 
> {mirror-test2=UNKNOWN_TOPIC_OR_PARTITION}
> {code}
> This log message is fine and expected. The problem is the log message stops 
> ~5 min later. At which point the logs look fine, but mirror maker is not 
> mirroring any of its topics. 
> What's worse is mirrorMaker is basically in an unrecoverable state once this 
> happens (the log statement stops). If you create the topic at the destination 
> mirrorMaker still won't mirror data until a restart. Attempts to restart 
> mirrorMaker (cleanly) fail because the process is more or less deadlocked in 
> its shutdown hook.
> Here is the reasoning:
> * MirrorMaker becomes unrecoverable after 5 minutes because of this loop in 
> the 
> [producer|https://github.com/apache/kafka/blob/e06cd3e55f25a0bb414e0770493906ea8019420a/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L543-L561]
> * The producer will keep waiting for metadata for the missing topic or until 
> the max timeout is reached. (max long in this case) 
> * after 5 minutes the producer stops making a metadata request for the topic 
> because that topic expires 
> [here|https://github.com/apache/kafka/blob/e06cd3e55f25a0bb414e0770493906ea8019420a/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L218]
>  
> * topic is never re-added for metadata requests since the only add is before 
> entering the loop 
> [here|(https://github.com/apache/kafka/blob/e06cd3e55f25a0bb414e0770493906ea8019420a/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L528]
> So basically after 5 minutes all metadata requests moving forward are for no 
> topics since the topic expired. The mirrorMaker thread essentially gets stuck 
> waiting forever since there will never be a metadata request for the topic 
> the thread is waiting on
> All of this basically leads to a deadlock state in the shutdown hook. 
> * shutdown hook sends a shutdown to the mirrorMaker threads 
> * waits for threads to exit their loop by waitind on a 
> [latch|https://github.com/apache/kafka/blob/0.10.2/core/src/main/scala/kafka/tools/MirrorMaker.scala#L396]
> * latch is never counted down in 
> [produce|https://github.com/apache/kafka/blob/0.10.2/core/src/main/scala/kafka/tools/MirrorMaker.scala#L434]
>  
> * thread will never exit the loop to countdown the latch on line 462.
> This can be seen with a thread dump of the shutdown hook thread
> {code}
> Name: MirrorMakerShutdownHook
> State: WAITING on java.util.concurrent.CountDownLatch$Sync@3ffebeac
> Total blocked: 0  Total waited: 1
> Stack trace: 
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
> kafka.tools.MirrorMaker$MirrorMakerThread.awaitShutdown(MirrorMaker.scala:498)
> kafka.tools.MirrorMaker$$anonfun$cleanShutdown$4.apply(MirrorMaker.scala:396)
> kafka.tools.MirrorMaker$$anonfun$cleanShutdown$4.apply(MirrorMaker.scala:396)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> {code}
> The root of the issue more or less related to the issue documented here where 
> the producer can block waiting for metadata. 
> https://issues.apache.org/jira/browse/KAFKA-3450



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to