[ https://issues.apache.org/jira/browse/KAFKA-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13940748#comment-13940748 ]
Jun Rao commented on KAFKA-1303: -------------------------------- The way the old producer works is the following. The producer always uses the brokers specified in metadata.broker.list for issuing metadata requests. The socket connections for sending metadata requests are separate from those used for sending the produce requests. metadata.broker.list can be configured with a vip or a list of brokers. In either case, it's the client's responsibility for making sure that there is at least 1 live broker in metadata.broker.list. The benefit of this approach is that metadata requests are never blocked behind produce requests, which reduces the probability of failed producer requests due to stale metadata. The way the new producer works is to only use metadata.broker.list when sending the very first metadata request. After that, it uses the cluster info returned in the meta request for issuing subsequent metadata and produce request. The client is still responsible for making sure that there is at least 1 live broker in metadata.broker.list. Otherwise, the producer won't work after a restart. This approach has the potential benefit of using fewer socket connections and balancing the metadata requests among more brokers after cluster expansion. However, currently, the implementation has the downside that metadata requests can be queued behind produce requests. My feeling is that the approach in the old producer gives a better tradeoff. Metadata requests are cheap and cluster expansion is rare. So load balancing metadata requests among new brokers is not that critical. To implement this behavior in the new producer, we can keep the metadata brokers in Cluster and only use those brokers for issuing metadata requests. To reduce # of sockets, we can either close the socket immediately after the metadata response is received or after the socket has been idle for some time. > metadata request in the new producer can be delayed > --------------------------------------------------- > > Key: KAFKA-1303 > URL: https://issues.apache.org/jira/browse/KAFKA-1303 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.2 > Reporter: Jun Rao > > While debugging a system test, I observed the following. > 1. A broker side configuration > (replica.fetch.wait.max.ms=500,replica.fetch.min.bytes=4096) made the time to > complete a produce request long (each taking about 500ms with ack=-1). > 2. The producer client has a bunch of outstanding produce requests queued up > on the brokers. > 3. One of the brokers fails and we force updating the metadata. > 4. The metadata request is queued up behind those outstanding producer > requests. > 5. By the time the metadata response comes back, some messages have failed > all retries because of stale metadata. -- This message was sent by Atlassian JIRA (v6.2#6252)