Jun 1. Hm, it looks like I didn't take this case into account in KIP. I see your point. Why don't we do the same thing as with reassign partitions - let's setup new (or reuse) ReassignedPartitionsIsrChangeListener that will check whether brokers that requested partitions restart catch up (in isr state) and update zk node /restart_partitions to remove irrelevant replicas. - This should be done instead of step 4) - Controller deletes zk node.
2. No, the intent _is_ actually make replicas auto-repaired. There are two parts to it - 1) catch IO exceptions, so that the whole broker doesn't crash; 2) request partition restart on io errors - re-fetch lost partitions through the described in KIP mechanism. At least, I believe, this is our goal, otherwise there are no benefits. Thanks, Andrii Biletskyi On Sat, Apr 11, 2015 at 2:20 AM, Jun Rao <j...@confluent.io> wrote: > Andrii, > > 1. I was wondering what if the controller fails over after step 4). Since > the ZK node is gone, how does the controller know those failed replicas due > to disk failures? Otherwise, the controller will assume those replicas are > alive again. > > 2. Just to clarify. In the proposal, those failed replicas will not be auto > repaired and those affected partitions will just be running in the under > replicated mode, right? To repair the failed replicas, the admin still > needs to stop the broker? > > Thanks, > > Jun > > > > On Fri, Apr 10, 2015 at 10:29 AM, Andrii Biletskyi < > andrii.bilets...@stealth.ly> wrote: > > > Todd, Jun, > > > > Thanks for comments. > > > > I agree we might want to change "fair" on disk partition assignment > > in scope of these changes. I'm open to suggestions, I didn't bring it > > up here because of the facts that Todd mentioned - there is still no > > clear understanding who should be responsible for assignment - > > broker or controller. > > > > 1. Yes, the way broker initiates partition restart should be discussed. > > But I don't understand the problem with controller failover. The intended > > workflow is the following: > > 0) On error Broker removes partitions from ReplicaManager and LogManager > > 1) Broker creates zk node > > 2) Controller picks up, re-generates leaders and followers for partitions > > 3) Controller sends new LeaderAndIsr and UpdateMetadata to the cluster > > 4) Controller deletes zk node > > Now, if controller fails between 3) and 4), yes, controller will send L&I > > requests twice, but broker which requested partition restart will > "ignore" > > second time because partition would have been created at that point - > > while handling "first" L&I request. > > > > 2. The main benefit, from my perspective, is that if currently any file > > IO error means broker halts, you have to remove disk, restart the broker, > > with this KIP on IO error we simply reject that single request (or any > > action during > > which file IO error occurred), broker detects affected partitions and > > silently > > restarts them, normally handling other requests at the same time (of > course > > if those are not related to the broken disk). > > > > 3. I agree, the lack of tools to perform such operational commands won't > > let us > > fully leverage JBOD architecture. That's why I think we should design it > > that > > way so implementing such tools must be a simple thing to do. But before > > that > > it'd be good to understand whether we are on the right path in general. > > > > Thanks, > > Andrii Biletskyi > > > > On Fri, Apr 10, 2015 at 6:56 PM, Jun Rao <j...@confluent.io> wrote: > > > > > Andrii, > > > > > > Thanks for writing up the proposal. A few thoughts on this. > > > > > > 1. Your proposal is to have the broker notify the controller about > failed > > > replicas. We need to think through this a bit more. The controller may > > fail > > > later. During the controller failover, it needs to be able to detect > > those > > > failed replicas again. Otherwise, it may revert some of the decisions > > that > > > it has made earlier. In the current proposal, it seems that the info > > about > > > the failed replicas will be lost during controller failover? > > > > > > 2. Overall, it's not very clear to me what benefit this proposal > > provides. > > > The proposal seems to detect failed disks and then just marks the > > > associated replicas as offline. How do we bring those replicas to > online > > > again? Do we have to stop the broker and either fix the failed disk or > > > remove it from the configured log dir? If so, there will still be a > down > > > time of the broker. The changes in the proposal is non-trivial. So, we > > need > > > to be certain that we get some significant benefits. > > > > > > 3. As Todd pointed out, it will be worth thinking through other issues > > > related to JBOD. > > > > > > Thanks, > > > > > > Jun > > > > > > On Thu, Apr 9, 2015 at 5:36 AM, Andrii Biletskyi < > > > andrii.bilets...@stealth.ly> wrote: > > > > > > > Hi, > > > > > > > > Let me start discussion thread for KIP-18 - JBOD Support. > > > > > > > > Link to wiki: > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support > > > > > > > > > > > > Thanks, > > > > Andrii Biletskyi > > > > > > > > > >