[ https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188628#comment-14188628 ]
Kyle Banker commented on KAFKA-1736: ------------------------------------ I'd like to propose one more idea: if we don't want to add another knob to Kafka, why not just modify the partition assignment algorithm so that, when the number of brokers is a multiple of the replication factor, you get the broker-level replication. When it's not, the arrangement will still be basically random, as it is now. I don't see how that wouldn't be a better default. > Improve parition-broker assignment strategy for better availaility in > majority durability modes > ----------------------------------------------------------------------------------------------- > > Key: KAFKA-1736 > URL: https://issues.apache.org/jira/browse/KAFKA-1736 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.8.1.1 > Reporter: Kyle Banker > Priority: Minor > > The current random strategy of partition-to-broker distribution combined with > a fairly typical use of min.isr and request.acks results in a suboptimal > level of availability. > Specifically, if all of your topics have a replication factor of 3, and you > use min.isr=2 and required.acks=all, then regardless of the number of the > brokers in the cluster, you can safely lose only 1 node. Losing more than 1 > node will, 95% of the time, result in the inability to write to at least one > partition, thus rendering the cluster unavailable. As the total number of > partitions increases, so does this probability. > On the other hand, if partitions are distributed so that brokers are > effectively replicas of each other, then the probability of unavailability > when two nodes are lost is significantly decreased. This probability > continues to decrease as the size of the cluster increases and, more > significantly, this probability is constant with respect to the total number > of partitions. The only requirement for getting these numbers with this > strategy is that the number of brokers be a multiple of the replication > factor. > Here are of the results of some simulations I've run: > With Random Partition Assignment > Number of Brokers / Number of Partitions / Replication Factor / Probability > that two randomly selected nodes will contain at least 1 of the same > partitions > 6 / 54 / 3 / .999 > 9 / 54 / 3 / .986 > 12 / 54 / 3 / .894 > Broker-Replica-Style Partitioning > Number of Brokers / Number of Partitions / Replication Factor / Probability > that two randomly selected nodes will contain at least 1 of the same > partitions > 6 / 54 / 3 / .424 > 9 / 54 / 3 / .228 > 12 / 54 / 3 / .168 > Adopting this strategy will greatly increase availability for users wanting > majority-style durability and should not change current behavior as long as > leader partitions are assigned evenly. I don't know of any negative impact > for other use cases, as in these cases, the distribution will still be > effectively random. > Let me know if you'd like to see simulation code and whether a patch would be > welcome. > EDIT: Just to clarify, here's how the current partition assigner would assign > 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set > of replicas). > 0 = Some(List(2, 3, 4)) > 1 = Some(List(3, 4, 5)) > 2 = Some(List(4, 5, 6)) > 3 = Some(List(5, 6, 7)) > 4 = Some(List(6, 7, 8)) > 5 = Some(List(7, 8, 9)) > 6 = Some(List(8, 9, 1)) > 7 = Some(List(9, 1, 2)) > 8 = Some(List(1, 2, 3)) > Here's how I'm proposing they be assigned: > 0 = Some(ArrayBuffer(8, 5, 2)) > 1 = Some(ArrayBuffer(8, 5, 2)) > 2 = Some(ArrayBuffer(8, 5, 2)) > 3 = Some(ArrayBuffer(7, 4, 1)) > 4 = Some(ArrayBuffer(7, 4, 1)) > 5 = Some(ArrayBuffer(7, 4, 1)) > 6 = Some(ArrayBuffer(6, 3, 0)) > 7 = Some(ArrayBuffer(6, 3, 0)) > 8 = Some(ArrayBuffer(6, 3, 0)) -- This message was sent by Atlassian JIRA (v6.3.4#6332)