[jira] [Commented] (KAFKA-188) Support multiple data directories

chenshangan (JIRA) Fri, 29 May 2015 17:45:49 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565723#comment-14565723
 ]


chenshangan commented on KAFKA-188:
-----------------------------------

[~jkreps] 
"I recommend we instead leave this as it is for initial placement and implement 
"rebalancing" option that actively migrates partitions to balance data between 
directories. This is harder to implement but I think it is what you actually 
want."

Exactly, this is what I really want, but it's pretty hard to implement. And in 
our use case, we seldom create a bunch of topics at the same time, topics are 
increasing day by day.

Common use case:
1. a new kafka cluster setup, lots of topics from other kafka cluster or system 
dump data into this new cluster. segments determined policy works well as all 
topics are started from zero, so segments are consistent with partitions.

2. an existing  kafka cluster, topics are added day by day. This is the ideal 
case, segments policy will work well. 

3. an existing kafka cluster, topics are added in bunch. It might cause all new 
topics being put on the same least directory, of course it will cause bad 
consequence. But if the cluster is big enough and disk counts and capacity of a 
broker is big enough, and this is not a common use case, the consequence will 
not be so serious. Users use this option should consider how to avoid such 
situation.

Above all, it's worthy providing such an option. But If we can implement a 
"rebalancing" option, it would be perfect.   


> Support multiple data directories
> ---------------------------------
>
>                 Key: KAFKA-188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-188
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Jay Kreps
>            Assignee: Jay Kreps
>             Fix For: 0.8.0
>
>         Attachments: KAFKA-188-v2.patch, KAFKA-188-v3.patch, 
> KAFKA-188-v4.patch, KAFKA-188-v5.patch, KAFKA-188-v6.patch, 
> KAFKA-188-v7.patch, KAFKA-188-v8.patch, KAFKA-188.patch
>
>
> Currently we allow only a single data directory. This means that a multi-disk 
> configuration needs to be a RAID array or LVM volume or something like that 
> to be mounted as a single directory.
> For a high-throughput low-reliability configuration this would mean RAID0 
> striping. Common wisdom in Hadoop land has it that a JBOD setup that just 
> mounts each disk as a separate directory and does application-level balancing 
> over these results in about 30% write-improvement. For example see this claim 
> here:
>   http://old.nabble.com/Re%3A-RAID-vs.-JBOD-p21466110.html
> It is not clear to me why this would be the case--it seems the RAID 
> controller should be able to balance writes as well as the application so it 
> may depend on the details of the setup.
> Nonetheless this would be really easy to implement, all you need to do is add 
> multiple data directories and balance partition creation over these disks.
> One problem this might cause is if a particular topic is much larger than the 
> others it might unbalance the load across the disks. The partition->disk 
> assignment policy should probably attempt to evenly spread each topic to 
> avoid this, rather than just trying keep the number of partitions balanced 
> between disks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-188) Support multiple data directories

Reply via email to