Hey Guozhang, Thanks for the review! Yes we have considered this approach and briefly explained why we don't do it in the rejected alternative section. Here is my concern with this approach in more detail:
- This approach introduces tight coupling between kafka's logical leader election with broker's local file OS config. My intuition is that this tight coupling may make future development a bit harder and we should try to avoid that. Note that we only use logical information (e.g. partition, broker id) in the zookeeper and controller as of now. - Encoding log directory in the replica identifier requires much more change in the code. In addition to changing znode data format in zookeeper, we probably need to update every protocol that touches replica id, such as StopReplicaRequest, ListOffsetRequest, LeaderAndIsrResponse and so on. Many Java classes need to be changes as well to recognize log directory in replica identifier. Arguably it is still possible to use broker id without log directory to identify replica in some protocols and Java classes under the assumption that no two replicas of the same partition can reside on the same broker. But we need to think carefully for each protocol and Java class and the result may be error prone and controversial. For simplicity of the discussion and code review, I prefer to only do this if there is strong benefit of this design. - Current approach in the KIP make it easier to move replicas between replicas on the same broker because that operation can be completely hidden from controller and other brokers. On the other hand, if we were to move replica between disk in the suggested approach, broker needs to write to some notification zookeeper path after movement is completed so that broker can send LeaderAndIsrRequest to get the new replica identifier, update it cache and write to znode /brokers/topics/[topic]/partitions/[partitionId]/ state. Dong On Sun, Jan 22, 2017 at 10:50 AM, Guozhang Wang <wangg...@gmail.com> wrote: > Hello Dong, > > Thanks for the very well written KIP. I had a general thought on the ZK > path management, wondering if the following alternative would work: > > 1. Bump up versions in "brokers/topics/[topic]" and > "/brokers/topics/[topic]/partitions/[partitionId]/state" > to 2, in which the replica id is no longer an int but a string. > > 2. Bump up versions in "/brokers/ids/[brokerId]" to add another field: > > { "fields": > [ {"name": "version", "type": "int", "doc": "version id"}, > {"name": "host", "type": "string", "doc": "ip address or host name of > the broker"}, > {"name": "port", "type": "int", "doc": "port of the broker"}, > {"name": "jmx_port", "type": "int", "doc": "port for jmx"} > {"name": "log_dirs", > "type": {"type": "array", > "items": "int", > "doc": "an array of the id of the log dirs in broker"} > }, > ] > } > > 3. The replica id can now either be an string-typed integer indicating that > all partitions on the broker still treated as failed or not as a whole, > i.e. no support needed for JBOD; or be a string typed "[brokerID]-[dirID]", > in which brokers / controllers can still parse to determine which broker is > hosting this replica: in this case the management of replicas is finer > grained, no longer at the broker level (i.e. if broker dies all replicas go > offline) but broker-dir level. > > 4. When broker had one of the dir failed, it can modify its " > /brokers/ids/[brokerId]" registry and remove the dir id, controller already > listening on this path can then be notified and run the replica assignment > accordingly where replica id is computed as above. > > > By doing this controller can also naturally reassign replicas between dirs > within the same broker. > > > Guozhang > > > On Thu, Jan 12, 2017 at 6:25 PM, Ismael Juma <ism...@juma.me.uk> wrote: > > > Thanks for the KIP. Just wanted to quickly say that it's great to see > > proposals for improving JBOD (KIP-113 too). More feedback soon, > hopefully. > > > > Ismael > > > > On Thu, Jan 12, 2017 at 6:46 PM, Dong Lin <lindon...@gmail.com> wrote: > > > > > Hi all, > > > > > > We created KIP-112: Handle disk failure for JBOD. Please find the KIP > > wiki > > > in the link https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 112%3A+Handle+disk+failure+for+JBOD. > > > > > > This KIP is related to KIP-113 > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 113%3A+Support+replicas+movement+between+log+directories>: > > > Support replicas movement between log directories. They are needed in > > order > > > to support JBOD in Kafka. Please help review the KIP. You feedback is > > > appreciated! > > > > > > Thanks, > > > Dong > > > > > > > > > -- > -- Guozhang >