[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256255#comment-14256255 ]
Parth Brahmbhatt commented on KAFKA-1788: ----------------------------------------- [~nehanarkhede] [~junrao] Can you provide input on what you think needs to be done here. There are 2 problems being discussed: * No leader is actually available for a long time, which is the original issue in this jira. This is the case where all replicas are in single DC/AZ and DC/AZ faces outage. In this case the record stays in RecordAccumulator forever as no node is ever ready, so no retries are ever attempted and as the max retries are not exhausted this batch is never dropped. The only way I see to solve this is by adding an expiry on batches and perform a cleanup on expired batches. * stale metadata because NetworkClient.leastLoadedNode() returns a bad node and keeps retrying against a bad node. unless I am missing something here, I think this just indicates bad configuration, we could reduce default TCP connection-socket/read timeout so we can fail fast but I am not entirely sure if we need to do anything in code to handle this case. The method already goes through all the nodes in the bootstrap list as leastLoadedNode() starts off with this.metadata.fetch().nodes() and tries to find a good node with fewest outstanding request. > producer record can stay in RecordAccumulator forever if leader is no > available > ------------------------------------------------------------------------------- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 > Project: Kafka > Issue Type: Bug > Components: core, producer > Affects Versions: 0.8.2 > Reporter: Jun Rao > Assignee: Jun Rao > Labels: newbie++ > Fix For: 0.8.3 > > > In the new producer, when a partition has no leader for a long time (e.g., > all replicas are down), the records for that partition will stay in the > RecordAccumulator until the leader is available. This may cause the > bufferpool to be full and the callback for the produced message to block for > a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)