Thanks for the update. This proposal looks reasonable to me.

One issue that needs a bit more thinking is what happens when a broker is
moved from one rack to another. After the move, constraints on
max-rack-replication could be violated. The question is what we should do
when this happens. A less intrusive approach is to let the broker start and
simply log a warning about the violation. Not sure if this is the best
approach though.

Jun


On Sun, Jan 26, 2014 at 10:07 AM, Joris VanRemoortere <
jvanremoort...@tagged.com> wrote:

> Thanks Jun,
>
> As you said, rack-aware assignment needs to be supported during topic
> creation, adding of partitions, and partition re-assignment.
>
> In order to reduce the risk of using a different value of
> max-rack-replication during different calls to those functions, I'd like to
> carry / persist that information as part of the topic config. This way the
> option is mandatory during topic creation, and optional partition
> re-assign.
>
> Here is my proposal:
>
> Topic Create:
>
>    - max-rack-replication mandatory
>    - store max-rack-replication as a topic config
>
> Partition Add:
>
>    - Use the stored value of max-rack-replication to add another partition
>    using the same max-rack-replication value as the rest of the partitions
> in
>    the topic
>
> Partition Re-assign:
>
>    - If max-rack-replication is provided, treat this as an altered config.
>    This forces a re-validation of all replica assignments of all partitions
>    for the topic. If this succeeds, save the updated topic config.
>    - If max-rack-replication is not provided, used the saved value in the
>    topic config.
>
> Alter Topic:
>
>    - If someone over-rides the max-rack-replication value, we re-validate
>    all replica assignments as with partition re-assignment.
>
> I think this strategy will be less error prone, making it impossible to
> have multiple partitions in a topic using different max-rack-replication
> values. Let me know if this sounds ok to you,
>
> Joris
>
>
> On Sun, Jan 26, 2014 at 9:15 AM, Jun Rao <jun...@gmail.com> wrote:
>
> > Replica assignments can change during topic creation, adding partitions,
> > and reassign partitions. So, it would be good to integrate rack-aware
> > assignment in those cases.
> >
> > For compatibility, to do a rolling upgrade of a Kafka cluster means that
> at
> > a given point of time, some brokers could be registered in the old format
> > and some others could be registered in the new format. Both the consumer
> > and the controller (
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Internals
> > )
> > read the broker registration in ZK. So, we need to make sure that the new
> > code can read the old format and the old code can read the new format
> too.
> > For example, in Broker.createBroker(), if the code is looking for a field
> > "rackid" but couldn't find it, we need to be able to provide a reasonable
> > default.
> >
> > Also, take a look at the format of broker registration in ZK in
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper
> > .
> > Since we are changing the format, we should probably change the
> > version
> > from 1 to 2. However, we can make this change backward compatible.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Fri, Jan 24, 2014 at 1:57 PM, Joris VanRemoortere <
> > jvanremoort...@tagged.com> wrote:
> >
> > > Working on the above.
> > >
> > > Since trunk currently allows Topic configs, and altering topics, does
> it
> > > make sense to you to make this a topic config so that it is kept in
> sync
> > > between topic alterations and partition adds?
> > >
> > > For compatibility: I can make the zookeeper state parsing backwards /
> > > forwards compatible. I don't think I can make the readFrom / writeTo
> > > ByteBuffer compatible since chained calls of these during api parsing
> are
> > > size dependent? Can you clarify exactly what you expect to be
> compatible?
> > >
> > > I'm also not very familiar with the consequences of changing the broker
> > > version in zookeeper registration. Can you please provide a reference
> to
> > > documentation for this?
> > >
> > > Thanks,
> > >
> > > Joris
> > >
> > >
> > > On Fri, Jan 24, 2014 at 9:43 AM, Jun Rao (JIRA) <j...@apache.org>
> wrote:
> > >
> > > >
> > > >     [
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/KAFKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881193#comment-13881193
> > > ]
> > > >
> > > > Jun Rao commented on KAFKA-1215:
> > > > --------------------------------
> > > >
> > > > Thanks for the patch. Looks good over all. Some comments.
> > > >
> > > > 1. KafkaConfig:
> > > > 1.1 We need a config for default max-rack-replication for auto topic
> > > > creation.
> > > > 1.2 rackId: We probably don't want to make this a required property.
> > So,
> > > > perhaps we can default it to 0?
> > > >
> > > > 2. AdminUtils.assignReplicasToBrokers():
> > > > 2.1 Could you add some comments on the rack-aware assignment
> algorithm?
> > > > 2.2 It's a bit weird for this method to take zkclient in the input.
> We
> > > > probably can pass in a list of Broker objects instead.
> > > >
> > > > 3. Unit tests: I suggest that we leave most existing tests intact by
> > > > keeping the rackId default and add a new test for rack-aware
> > assignment.
> > > >
> > > > 4. Compatibility test: It seems that the changes for the broker
> format
> > in
> > > > ZK is backward compatible. Could you double check? For example, an
> old
> > > > reader (controller, consumer, etc) should be able to parse the broker
> > > > registered in new format and a new reader should be able to parse the
> > > > broker registered in the old format. Also, we probably should
> increase
> > > the
> > > > version in the ZK registration for the broker.
> > > >
> > > > 5. Could you rebase to trunk?
> > > >
> > > > > Rack-Aware replica assignment option
> > > > > ------------------------------------
> > > > >
> > > > >                 Key: KAFKA-1215
> > > > >                 URL:
> > https://issues.apache.org/jira/browse/KAFKA-1215
> > > > >             Project: Kafka
> > > > >          Issue Type: Improvement
> > > > >          Components: replication
> > > > >    Affects Versions: 0.8.0, 0.8.1
> > > > >            Reporter: Joris Van Remoortere
> > > > >            Assignee: Neha Narkhede
> > > > >             Fix For: 0.8.0, 0.8.1
> > > > >
> > > > >         Attachments: rack_aware_replica_assignment_v1.patch
> > > > >
> > > > >
> > > > > Adding a rack-id to kafka config. This rack-id can be used during
> > > > replica assignment by using the max-rack-replication argument in the
> > > admin
> > > > scripts (create topic, etc.). By default the original replication
> > > > assignment algorithm is used because max-rack-replication defaults to
> > -1.
> > > > max-rack-replication > -1 is not honored if you are doing manual
> > replica
> > > > assignment (preffered).
> > > > > If this looks good I can add some test cases specific to the
> > rack-aware
> > > > assignment.
> > > > > I can also port this to trunk. We are currently running 0.8.0 in
> > > > production and need this, so i wrote the patch against that.
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v6.1.5#6160)
> > > >
> > >
> >
>

Reply via email to