Re: [DISCUSS] KIP-18 - JBOD Support

Andrii Biletskyi Sun, 12 Apr 2015 15:02:55 -0700

Jun

1. Hm, it looks like I didn't take this case into account in KIP.
I see your point. Why don't we do the same thing as with reassign
partitions - let's setup new (or
reuse) ReassignedPartitionsIsrChangeListener
that will check whether brokers that requested partitions restart
catch up (in isr state) and update zk node /restart_partitions to remove
irrelevant replicas. - This should be done instead of step 4) - Controller
deletes zk node.


2. No, the intent _is_ actually make replicas auto-repaired.
There are two parts to it - 1) catch IO exceptions, so that the whole
broker doesn't crash; 2) request partition restart on io errors - re-fetch
lost partitions through the described in KIP mechanism.
At least, I believe, this is our goal, otherwise there are no benefits.

Thanks,
Andrii Biletskyi


On Sat, Apr 11, 2015 at 2:20 AM, Jun Rao <j...@confluent.io> wrote:

> Andrii,
>
> 1. I was wondering what if the controller fails over after step 4). Since
> the ZK node is gone, how does the controller know those failed replicas due
> to disk failures? Otherwise, the controller will assume those replicas are
> alive again.
>
> 2. Just to clarify. In the proposal, those failed replicas will not be auto
> repaired and those affected partitions will just be running in the under
> replicated mode, right? To repair the failed replicas, the admin still
> needs to stop the broker?
>
> Thanks,
>
> Jun
>
>
>
> On Fri, Apr 10, 2015 at 10:29 AM, Andrii Biletskyi <
> andrii.bilets...@stealth.ly> wrote:
>
> > Todd, Jun,
> >
> > Thanks for comments.
> >
> > I agree we might want to change "fair" on disk partition assignment
> > in scope of these changes. I'm open to suggestions, I didn't bring it
> > up here because of the facts that Todd mentioned - there is still  no
> > clear understanding who should be responsible for assignment -
> > broker or controller.
> >
> > 1. Yes, the way broker initiates partition restart should be discussed.
> > But I don't understand the problem with controller failover. The intended
> > workflow is the following:
> > 0) On error Broker removes partitions from ReplicaManager and LogManager
> > 1) Broker creates zk node
> > 2) Controller picks up, re-generates leaders and followers for partitions
> > 3) Controller sends new LeaderAndIsr and UpdateMetadata to the cluster
> > 4) Controller deletes zk node
> > Now, if controller fails between 3) and 4), yes, controller will send L&I
> > requests twice, but broker which requested partition restart will
> "ignore"
> > second time because partition would have been created at that point -
> > while handling "first" L&I request.
> >
> > 2. The main benefit, from my perspective, is that if currently any file
> > IO error means broker halts, you have to remove disk, restart the broker,
> > with this KIP on IO error we simply reject that single request (or any
> > action during
> > which file IO error occurred), broker detects affected partitions and
> > silently
> > restarts them, normally handling other requests at the same time (of
> course
> > if those are not related to the broken disk).
> >
> > 3. I agree, the lack of tools to perform such operational commands won't
> > let us
> > fully leverage JBOD architecture. That's why I think we should design it
> > that
> > way so implementing such tools must be a simple thing to do. But before
> > that
> > it'd be good to understand whether we are on the right path in general.
> >
> > Thanks,
> > Andrii Biletskyi
> >
> > On Fri, Apr 10, 2015 at 6:56 PM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Andrii,
> > >
> > > Thanks for writing up the proposal. A few thoughts on this.
> > >
> > > 1. Your proposal is to have the broker notify the controller about
> failed
> > > replicas. We need to think through this a bit more. The controller may
> > fail
> > > later. During the controller failover, it needs to be able to detect
> > those
> > > failed replicas again. Otherwise, it may revert some of the decisions
> > that
> > > it has made earlier. In the current proposal, it seems that the info
> > about
> > > the failed replicas will be lost during controller failover?
> > >
> > > 2. Overall, it's not very clear to me what benefit this proposal
> > provides.
> > > The proposal seems to detect failed disks and then just marks the
> > > associated replicas as offline. How do we bring those replicas to
> online
> > > again? Do we have to stop the broker and either fix the failed disk or
> > > remove it from the configured log dir? If so, there will still be a
> down
> > > time of the broker. The changes in the proposal is non-trivial. So, we
> > need
> > > to be certain that we get some significant benefits.
> > >
> > > 3. As Todd pointed out, it will be worth thinking through other issues
> > > related to JBOD.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Apr 9, 2015 at 5:36 AM, Andrii Biletskyi <
> > > andrii.bilets...@stealth.ly> wrote:
> > >
> > > > Hi,
> > > >
> > > > Let me start discussion thread for KIP-18 - JBOD Support.
> > > >
> > > > Link to wiki:
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support
> > > >
> > > >
> > > > Thanks,
> > > > Andrii Biletskyi
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-18 - JBOD Support

Reply via email to