Thanks for the detailed explanations Dong. That makes sense to me.
Guozhang On Sun, Jan 22, 2017 at 4:00 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Guozhang, > > Thanks for the review! Yes we have considered this approach and briefly > explained why we don't do it in the rejected alternative section. Here is > my concern with this approach in more detail: > > - This approach introduces tight coupling between kafka's logical leader > election with broker's local file OS config. My intuition is that this > tight coupling may make future development a bit harder and we should try > to avoid that. Note that we only use logical information (e.g. partition, > broker id) in the zookeeper and controller as of now. > > - Encoding log directory in the replica identifier requires much more > change in the code. In addition to changing znode data format in zookeeper, > we probably need to update every protocol that touches replica id, such as > StopReplicaRequest, ListOffsetRequest, LeaderAndIsrResponse and so on. Many > Java classes need to be changes as well to recognize log directory in > replica identifier. Arguably it is still possible to use broker id without > log directory to identify replica in some protocols and Java classes under > the assumption that no two replicas of the same partition can reside on the > same broker. But we need to think carefully for each protocol and Java > class and the result may be error prone and controversial. For simplicity > of the discussion and code review, I prefer to only do this if there is > strong benefit of this design. > > - Current approach in the KIP make it easier to move replicas between > replicas on the same broker because that operation can be completely hidden > from controller and other brokers. On the other hand, if we were to move > replica between disk in the suggested approach, broker needs to write to > some notification zookeeper path after movement is completed so that broker > can send LeaderAndIsrRequest to get the new replica identifier, update it > cache and write to znode /brokers/topics/[topic]/partitions/[partitionId]/ > state. > > Dong > > > On Sun, Jan 22, 2017 at 10:50 AM, Guozhang Wang <wangg...@gmail.com> > wrote: > > > Hello Dong, > > > > Thanks for the very well written KIP. I had a general thought on the ZK > > path management, wondering if the following alternative would work: > > > > 1. Bump up versions in "brokers/topics/[topic]" and > > "/brokers/topics/[topic]/partitions/[partitionId]/state" > > to 2, in which the replica id is no longer an int but a string. > > > > 2. Bump up versions in "/brokers/ids/[brokerId]" to add another field: > > > > { "fields": > > [ {"name": "version", "type": "int", "doc": "version id"}, > > {"name": "host", "type": "string", "doc": "ip address or host name > of > > the broker"}, > > {"name": "port", "type": "int", "doc": "port of the broker"}, > > {"name": "jmx_port", "type": "int", "doc": "port for jmx"} > > {"name": "log_dirs", > > "type": {"type": "array", > > "items": "int", > > "doc": "an array of the id of the log dirs in broker"} > > }, > > ] > > } > > > > 3. The replica id can now either be an string-typed integer indicating > that > > all partitions on the broker still treated as failed or not as a whole, > > i.e. no support needed for JBOD; or be a string typed > "[brokerID]-[dirID]", > > in which brokers / controllers can still parse to determine which broker > is > > hosting this replica: in this case the management of replicas is finer > > grained, no longer at the broker level (i.e. if broker dies all replicas > go > > offline) but broker-dir level. > > > > 4. When broker had one of the dir failed, it can modify its " > > /brokers/ids/[brokerId]" registry and remove the dir id, controller > already > > listening on this path can then be notified and run the replica > assignment > > accordingly where replica id is computed as above. > > > > > > By doing this controller can also naturally reassign replicas between > dirs > > within the same broker. > > > > > > Guozhang > > > > > > On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma <ism...@juma.me.uk> wrote: > > > > > Thanks for the KIP. Just wanted to quickly say that it's great to see > > > proposals for improving JBOD (KIP-113 too). More feedback soon, > > hopefully. > > > > > > Ismael > > > > > > On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP > > > wiki > > > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 112%3A+Handle+disk+failure+for+JBOD. > > > > > > > > This KIP is related to KIP-113 > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 113%3A+Support+replicas+movement+between+log+directories>: > > > > Support replicas movement between log directories. They are needed in > > > order > > > > to support JBOD in Kafka. Please help review the KIP. You feedback is > > > > appreciated! > > > > > > > > Thanks, > > > > Dong > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang