Hi Jason, Apologies for the delay in this reply.
Thank you for having having a look at this KIP and sharing your suggestions. > 1. (nit): Instead of "storage id," maybe we could call it "directory id"? > It seems a little clear since each log dir gets a unique id. I agree, "directory id" is a more suitable name. I've updated the KIP accordingly. > 2. Rather than introducing a new RPC to communicate offline directories, > would it be reasonable to add it to BrokerHeartbeat? My thinking is that we > can let broker registration include the complete list of log dirs. The > heartbeat can then be used to communicate online/offline status of each log > dir. Brokers may startup with offline log directories. If brokers only enumerate log directories in the `BrokerRegistration` request the controller will have to wait for the first heartbeat to make a decision about which log directories are usable in each broker. It would be better to avoid having to model this in-between phase, where brokers are registered but the status of each log directory is still undetermined. For this reason, I think `BrokerRegistration` should always enumerate both online and offline log directories. As to whether we should use a dedicated RPC to communicate online/offline status, or extend `BrokerHeartbeat`, I don't think either approach is clearly better over the other. I can see good reasons to extend `BrokerHeartbeat`: - We avoid having an extra RPC with a very narrow and specific functionality. - There is overlap in the handling of the request with the current handling for `BrokerHeartbeat` in that it already a) changes the broker registration, and b) generates leader and ISR updates. When a log directory comes offline, both a) and b) would also need to happen. - Continuously signalling the directory status can end up leading to higher resiliency in some eventual failures scenarios which could cause the one time RPC notification to be missed or incorrectly processed by the controller. But I can also see reasons why we might prefer a separate RPC: - If the broker includes a list of all online/offline log directories in every heartbeat request it creates unnecessary verbosity, it's not a very compact form of communication. We can mitigate this by having the broker send only changes in log directory status, compared with the broker's view of metadata, rather than the full list, but that also means a bit more complexity. - It ties `broker.heartbeat.interval.ms` with the time to recovery from failed log directories. Perhaps this is OK, because it's already tied with the recovery time from failed brokers. - It creates new space for bad requests. A second heartbeat could indicate a previously offline log directory as being online, which is currently not possible, as log directories can only come back online with a broker restart. Perhaps, this might actually be useful in the future, if we ever want to supporting bringing log directories back online without a broker restart. So while I'm not sure it is the necessarily the best approach, yes, I think it would be reasonable to use `BrokerHeartbeat` instead of a new RPC. There is currently no "one-off" RPC to change non heartbeat-related aspects of the broker registration, if there were, I think extending that request would be best. What do you think about introducing an RPC to modify the broker registration, rather than a dedicated RPC to signal offline dirs? Such RPC could become useful later, as/if the broker registration evolves. > 3. I think we have an opportunity to consolidate normal partition > reassignment and log dir reassignment here. Since we are modifying the > `Assignment` to consist of both replica id and storage id, couldn't we make > a similar change to the `AlterPartitionReassignment` API? Basically that > would let us treat all reassignments as moving from one log dir to another. > The case handled by `AlterReplicaLogDir` currently would just become a > special case where the replica id happens to be the same. This might also > let us get rid of all the custom logic surrounding JBOD, such as the > additional fetcher, which has proven difficult to maintain. I really like this idea. It's not a well known feature but KIP-113 introduced support for moving partition replicas across brokers into specific log directories, by combining requests of `AlterReplicaLogDir` and `AlterPartitionReassignment` and having the broker remember the target directory as a preferred log dir. [1][2][3] But the usability isn't great, as it involves multiple requests and expecting errors as part of a successful operation. It would be really nice to extend `AlterPartitionReassignment` to support specifying log directories and perhaps deprecate `AlterReplicaLogDir`. So it would be great to consolidate these operations. We'd have to decide whether to break the current behavior - where a partition replica can be moved to a different log directory while the broker is offline without issues - or find a way to distinguish when log directory assignment indicated in the metadata can be corrected (because the partition was moved by an operator while the broker was offline) vs when it must be followed by the broker (because a user called `AlterPartitionReassignment`). Either way, my inclination is that this work is better suited for a follow-up KIP, as the scope for this one already feels quite large. Let me know what you think. Thanks, -- Igor [1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories [2] https://github.com/apache/kafka/commit/adefc8ea076354e07839f0319fee1fba52343b91#diff-66379edf0ca0d6a674194b6613f1b4cd5cac0c8eee366e67ada68975d43b3387 [3] https://lists.apache.org/thread/h0tb3hgrvwl36xvoj2nzb096of25v6mm