Thanks for the detailed explanations Dong. That makes sense to me.

Guozhang

On Sun, Jan 22, 2017 at 4:00 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Guozhang,
>
> Thanks for the review! Yes we have considered this approach and briefly
> explained why we don't do it in the rejected alternative section. Here is
> my concern with this approach in more detail:
>
> - This approach introduces tight coupling between kafka's logical leader
> election with broker's local file OS config. My intuition is that this
> tight coupling may make future development a bit harder and we should try
> to avoid that. Note that we only use logical information (e.g. partition,
> broker id) in the zookeeper and controller as of now.
>
> - Encoding log directory in the replica identifier requires much more
> change in the code. In addition to changing znode data format in zookeeper,
> we probably need to update every protocol that touches replica id, such as
> StopReplicaRequest, ListOffsetRequest, LeaderAndIsrResponse and so on. Many
> Java classes need to be changes as well to recognize log directory in
> replica identifier. Arguably it is still possible to use broker id without
> log directory to identify replica in some protocols and Java classes under
> the assumption that no two replicas of the same partition can reside on the
> same broker. But we need to think carefully for each protocol and Java
> class and the result may be error prone and controversial. For simplicity
> of the discussion and code review, I prefer to only do this if there is
> strong benefit of this design.
>
> - Current approach in the KIP make it easier to move replicas between
> replicas on the same broker because that operation can be completely hidden
> from controller and other brokers. On the other hand, if we were to move
> replica between disk in the suggested approach, broker needs to write to
> some notification zookeeper path after movement is completed so that broker
> can send LeaderAndIsrRequest to get the new replica identifier, update it
> cache and write to znode /brokers/topics/[topic]/partitions/[partitionId]/
> state.
>
> Dong
>
>
> On Sun, Jan 22, 2017 at 10:50 AM, Guozhang Wang <wangg...@gmail.com>
> wrote:
>
> > Hello Dong,
> >
> > Thanks for the very well written KIP. I had a general thought on the ZK
> > path management, wondering if the following alternative would work:
> >
> > 1. Bump up versions in "brokers/topics/[topic]" and
> > "/brokers/topics/[topic]/partitions/[partitionId]/state"
> > to 2, in which the replica id is no longer an int but a string.
> >
> > 2. Bump up versions in "/brokers/ids/[brokerId]" to add another field:
> >
> > { "fields":
> >     [ {"name": "version", "type": "int", "doc": "version id"},
> >       {"name": "host", "type": "string", "doc": "ip address or host name
> of
> > the broker"},
> >       {"name": "port", "type": "int", "doc": "port of the broker"},
> >       {"name": "jmx_port", "type": "int", "doc": "port for jmx"}
> >       {"name": "log_dirs",
> >        "type": {"type": "array",
> >                 "items": "int",
> >                 "doc": "an array of the id of the log dirs in broker"}
> >       },
> >     ]
> > }
> >
> > 3. The replica id can now either be an string-typed integer indicating
> that
> > all partitions on the broker still treated as failed or not as a whole,
> > i.e. no support needed for JBOD; or be a string typed
> "[brokerID]-[dirID]",
> > in which brokers / controllers can still parse to determine which broker
> is
> > hosting this replica: in this case the management of replicas is finer
> > grained, no longer at the broker level (i.e. if broker dies all replicas
> go
> > offline) but broker-dir level.
> >
> > 4. When broker had one of the dir failed, it can modify its "
> > /brokers/ids/[brokerId]" registry and remove the dir id, controller
> already
> > listening on this path can then be notified and run the replica
> assignment
> > accordingly where replica id is computed as above.
> >
> >
> > By doing this controller can also naturally reassign replicas between
> dirs
> > within the same broker.
> >
> >
> > Guozhang
> >
> >
> > On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Thanks for the KIP. Just wanted to quickly say that it's great to see
> > > proposals for improving JBOD (KIP-113 too). More feedback soon,
> > hopefully.
> > >
> > > Ismael
> > >
> > > On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP
> > > wiki
> > > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 112%3A+Handle+disk+failure+for+JBOD.
> > > >
> > > > This KIP is related to KIP-113
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 113%3A+Support+replicas+movement+between+log+directories>:
> > > > Support replicas movement between log directories. They are needed in
> > > order
> > > > to support JBOD in Kafka. Please help review the KIP. You feedback is
> > > > appreciated!
> > > >
> > > > Thanks,
> > > > Dong
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Reply via email to