Thanks for the update. This proposal looks reasonable to me. One issue that needs a bit more thinking is what happens when a broker is moved from one rack to another. After the move, constraints on max-rack-replication could be violated. The question is what we should do when this happens. A less intrusive approach is to let the broker start and simply log a warning about the violation. Not sure if this is the best approach though.
Jun On Sun, Jan 26, 2014 at 10:07 AM, Joris VanRemoortere < jvanremoort...@tagged.com> wrote: > Thanks Jun, > > As you said, rack-aware assignment needs to be supported during topic > creation, adding of partitions, and partition re-assignment. > > In order to reduce the risk of using a different value of > max-rack-replication during different calls to those functions, I'd like to > carry / persist that information as part of the topic config. This way the > option is mandatory during topic creation, and optional partition > re-assign. > > Here is my proposal: > > Topic Create: > > - max-rack-replication mandatory > - store max-rack-replication as a topic config > > Partition Add: > > - Use the stored value of max-rack-replication to add another partition > using the same max-rack-replication value as the rest of the partitions > in > the topic > > Partition Re-assign: > > - If max-rack-replication is provided, treat this as an altered config. > This forces a re-validation of all replica assignments of all partitions > for the topic. If this succeeds, save the updated topic config. > - If max-rack-replication is not provided, used the saved value in the > topic config. > > Alter Topic: > > - If someone over-rides the max-rack-replication value, we re-validate > all replica assignments as with partition re-assignment. > > I think this strategy will be less error prone, making it impossible to > have multiple partitions in a topic using different max-rack-replication > values. Let me know if this sounds ok to you, > > Joris > > > On Sun, Jan 26, 2014 at 9:15 AM, Jun Rao <jun...@gmail.com> wrote: > > > Replica assignments can change during topic creation, adding partitions, > > and reassign partitions. So, it would be good to integrate rack-aware > > assignment in those cases. > > > > For compatibility, to do a rolling upgrade of a Kafka cluster means that > at > > a given point of time, some brokers could be registered in the old format > > and some others could be registered in the new format. Both the consumer > > and the controller ( > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Internals > > ) > > read the broker registration in ZK. So, we need to make sure that the new > > code can read the old format and the old code can read the new format > too. > > For example, in Broker.createBroker(), if the code is looking for a field > > "rackid" but couldn't find it, we need to be able to provide a reasonable > > default. > > > > Also, take a look at the format of broker registration in ZK in > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper > > . > > Since we are changing the format, we should probably change the > > version > > from 1 to 2. However, we can make this change backward compatible. > > > > Thanks, > > > > Jun > > > > > > On Fri, Jan 24, 2014 at 1:57 PM, Joris VanRemoortere < > > jvanremoort...@tagged.com> wrote: > > > > > Working on the above. > > > > > > Since trunk currently allows Topic configs, and altering topics, does > it > > > make sense to you to make this a topic config so that it is kept in > sync > > > between topic alterations and partition adds? > > > > > > For compatibility: I can make the zookeeper state parsing backwards / > > > forwards compatible. I don't think I can make the readFrom / writeTo > > > ByteBuffer compatible since chained calls of these during api parsing > are > > > size dependent? Can you clarify exactly what you expect to be > compatible? > > > > > > I'm also not very familiar with the consequences of changing the broker > > > version in zookeeper registration. Can you please provide a reference > to > > > documentation for this? > > > > > > Thanks, > > > > > > Joris > > > > > > > > > On Fri, Jan 24, 2014 at 9:43 AM, Jun Rao (JIRA) <j...@apache.org> > wrote: > > > > > > > > > > > [ > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881193#comment-13881193 > > > ] > > > > > > > > Jun Rao commented on KAFKA-1215: > > > > -------------------------------- > > > > > > > > Thanks for the patch. Looks good over all. Some comments. > > > > > > > > 1. KafkaConfig: > > > > 1.1 We need a config for default max-rack-replication for auto topic > > > > creation. > > > > 1.2 rackId: We probably don't want to make this a required property. > > So, > > > > perhaps we can default it to 0? > > > > > > > > 2. AdminUtils.assignReplicasToBrokers(): > > > > 2.1 Could you add some comments on the rack-aware assignment > algorithm? > > > > 2.2 It's a bit weird for this method to take zkclient in the input. > We > > > > probably can pass in a list of Broker objects instead. > > > > > > > > 3. Unit tests: I suggest that we leave most existing tests intact by > > > > keeping the rackId default and add a new test for rack-aware > > assignment. > > > > > > > > 4. Compatibility test: It seems that the changes for the broker > format > > in > > > > ZK is backward compatible. Could you double check? For example, an > old > > > > reader (controller, consumer, etc) should be able to parse the broker > > > > registered in new format and a new reader should be able to parse the > > > > broker registered in the old format. Also, we probably should > increase > > > the > > > > version in the ZK registration for the broker. > > > > > > > > 5. Could you rebase to trunk? > > > > > > > > > Rack-Aware replica assignment option > > > > > ------------------------------------ > > > > > > > > > > Key: KAFKA-1215 > > > > > URL: > > https://issues.apache.org/jira/browse/KAFKA-1215 > > > > > Project: Kafka > > > > > Issue Type: Improvement > > > > > Components: replication > > > > > Affects Versions: 0.8.0, 0.8.1 > > > > > Reporter: Joris Van Remoortere > > > > > Assignee: Neha Narkhede > > > > > Fix For: 0.8.0, 0.8.1 > > > > > > > > > > Attachments: rack_aware_replica_assignment_v1.patch > > > > > > > > > > > > > > > Adding a rack-id to kafka config. This rack-id can be used during > > > > replica assignment by using the max-rack-replication argument in the > > > admin > > > > scripts (create topic, etc.). By default the original replication > > > > assignment algorithm is used because max-rack-replication defaults to > > -1. > > > > max-rack-replication > -1 is not honored if you are doing manual > > replica > > > > assignment (preffered). > > > > > If this looks good I can add some test cases specific to the > > rack-aware > > > > assignment. > > > > > I can also port this to trunk. We are currently running 0.8.0 in > > > > production and need this, so i wrote the patch against that. > > > > > > > > > > > > > > > > -- > > > > This message was sent by Atlassian JIRA > > > > (v6.1.5#6160) > > > > > > > > > >