[ 
https://issues.apache.org/jira/browse/KAFKA-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189041#comment-14189041
 ] 

Neha Narkhede commented on KAFKA-1736:
--------------------------------------

[~kbanker] Your idea of introducing some sort of clumping is interesting and, 
as [~gwenshap] suggested, if the clumps were racks, it might work really well. 
SInce, now, with a failed rack, you would prevent loss of availability since 2 
replicas of a partition wouldn't share the same clump or rack. I also think 
though, that we might have to experiment with varying clump sizes, instead of 
just 3, just so we end up supporting topics with varying replication factors 
easily. And I wouldn't be opposed to allowing the capability to add this 
replica assignment schemes as an option. Though before doing that, one thing to 
consider is the impact on leader placement since the current strategy, in 
addition to just assigning replicas to brokers, also allows for evenly 
spreading the leaders on the brokers in the cluster. It "staggers" the 
partition replicas on brokers such that if leaders live on the preferred 
replica, you end up with an even leader placement that in turn helps load 
balancing. 

If we were to add pluggable replica placement strategies, we might have to 
expose it in the topics tool in addition to the reassignment tool. 

FYI - We have patches for rack aware replica placement (KAFKA-1215, KAFKA-1357)

> Improve parition-broker assignment strategy for better availaility in 
> majority durability modes
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1736
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1736
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.8.1.1
>            Reporter: Kyle Banker
>            Priority: Minor
>         Attachments: Partitioner.scala
>
>
> The current random strategy of partition-to-broker distribution combined with 
> a fairly typical use of min.isr and request.acks results in a suboptimal 
> level of availability.
> Specifically, if all of your topics have a replication factor of 3, and you 
> use min.isr=2 and required.acks=all, then regardless of the number of the 
> brokers in the cluster, you can safely lose only 1 node. Losing more than 1 
> node will, 95% of the time, result in the inability to write to at least one 
> partition, thus rendering the cluster unavailable. As the total number of 
> partitions increases, so does this probability.
> On the other hand, if partitions are distributed so that brokers are 
> effectively replicas of each other, then the probability of unavailability 
> when two nodes are lost is significantly decreased. This probability 
> continues to decrease as the size of the cluster increases and, more 
> significantly, this probability is constant with respect to the total number 
> of partitions. The only requirement for getting these numbers with this 
> strategy is that the number of brokers be a multiple of the replication 
> factor.
> Here are of the results of some simulations I've run:
> With Random Partition Assignment
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .999
> 9  / 54 / 3 / .986
> 12 / 54 / 3 / .894
> Broker-Replica-Style Partitioning
> Number of Brokers / Number of Partitions / Replication Factor / Probability 
> that two randomly selected nodes will contain at least 1 of the same 
> partitions
> 6  / 54 / 3 / .424
> 9  / 54 / 3 / .228
> 12 / 54 / 3 / .168
> Adopting this strategy will greatly increase availability for users wanting 
> majority-style durability and should not change current behavior as long as 
> leader partitions are assigned evenly. I don't know of any negative impact 
> for other use cases, as in these cases, the distribution will still be 
> effectively random.
> Let me know if you'd like to see simulation code and whether a patch would be 
> welcome.
> EDIT: Just to clarify, here's how the current partition assigner would assign 
> 9 partitions with 3 replicas each to a 9-node cluster (broker number -> set 
> of replicas).
> 0 = Some(List(2, 3, 4))
> 1 = Some(List(3, 4, 5))
> 2 = Some(List(4, 5, 6))
> 3 = Some(List(5, 6, 7))
> 4 = Some(List(6, 7, 8))
> 5 = Some(List(7, 8, 9))
> 6 = Some(List(8, 9, 1))
> 7 = Some(List(9, 1, 2))
> 8 = Some(List(1, 2, 3))
> Here's how I'm proposing they be assigned:
> 0 = Some(ArrayBuffer(8, 5, 2))
> 1 = Some(ArrayBuffer(8, 5, 2))
> 2 = Some(ArrayBuffer(8, 5, 2))
> 3 = Some(ArrayBuffer(7, 4, 1))
> 4 = Some(ArrayBuffer(7, 4, 1))
> 5 = Some(ArrayBuffer(7, 4, 1))
> 6 = Some(ArrayBuffer(6, 3, 0))
> 7 = Some(ArrayBuffer(6, 3, 0))
> 8 = Some(ArrayBuffer(6, 3, 0))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to