[ https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333041#comment-14333041 ]
Ewen Cheslack-Postava commented on KAFKA-1788: ---------------------------------------------- Ok, I'll try to clear up a few issues. I think that just making sure we make NetworkClient.leastLoadedNode eventually returns all nodes isn't sufficient. I was just raising another case where this issue could occur. The reason this isn't sufficient for the original case is due to the type of situation [~Bmis13] raises. If you have a temporary network outage to a single broker (e.g. due to firewall misconfiguration or just a network partition issues), it may still correctly be listed as leader. If holding the data in the RecordAccumulator only affected data sent to that one broker, then as [~jkreps] points out, we could potentially get away with just holding on to the messages indefinitely since errors should manifest in other ways. (I think it's *better* to have the timeouts, but not strictly necessary). However, since the RecordAccumulator is a shared resource, holding onto these messages also means you're going to block sending data to other brokers once your buffer fills up with data for the unreachable broker. Adding timeouts at least ensures messages for these other brokers will eventually get a chance to send data, even if there are periods where they are automatically rejected because the buffer is already full. So [~parth.brahmbhatt], I think the approach you're trying to take in the patch is definitely the right thing to do, and I agree with [~Bmis13] that the error record metrics definitely should (eventually) be increasing. More generally -- yes, pretty much everything that could potentially block things up for a long time/indefinitely *should* have a timeout. And in a lot of cases this is true even if the operation will eventually timeout "naturally", e.g. due to a TCP timeout. It's better to have control over the timeout (even if we highly recommend using the default values) than rely on settings from other systems, especially when they may be adjusted in unexpected ways outside of our control. This is a pervasive concern that we should keep an eye out for with new code, and try to file JIRAs for as we find missing timeouts in existing code. Given the above, I think the options for controlling memory usage may not be very good for some use cases -- we've been saying people should use a single producer where possible since it's very fast and you actually benefit from sharing the network thread since you can collect all data for all topic-partitions destined for the same broker into a single request. But it turns out that sharing the underlying resources (the buffer) can lead to starvation for some topic-partitions when it shouldn't really be necessary. Would it make sense to allow a per-topic, or even per-partition limit on memory usage? So the effect would be similar to fetch.message.max.bytes for the consumer, where your actual memory usage cap is a n times the value, where n is the number of topic-partitions you're working with? It could also be by broker, but I think that leads to much less intuitive and harder to predict behavior. If people think that's a good idea we can file an additional issue for that. > producer record can stay in RecordAccumulator forever if leader is no > available > ------------------------------------------------------------------------------- > > Key: KAFKA-1788 > URL: https://issues.apache.org/jira/browse/KAFKA-1788 > Project: Kafka > Issue Type: Bug > Components: core, producer > Affects Versions: 0.8.2.0 > Reporter: Jun Rao > Assignee: Jun Rao > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1788.patch, KAFKA-1788_2015-01-06_13:42:37.patch, > KAFKA-1788_2015-01-06_13:44:41.patch > > > In the new producer, when a partition has no leader for a long time (e.g., > all replicas are down), the records for that partition will stay in the > RecordAccumulator until the leader is available. This may cause the > bufferpool to be full and the callback for the produced message to block for > a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)