Re: [DISCUSS] KIP-18 - JBOD Support

Andrii Biletskyi Mon, 13 Apr 2015 04:02:13 -0700

Jun,

2. Yes, you are right. The idea is that different exceptions and different
error causes may require different actions. That's why instead of "generic"
error handling logic it is proposed to implement a separate component -
ExceptionHandler that will encapsulate logic to take the right action
depending
on the exception. In your case IOException("No space left on device") will
be
thrown. We can actually detect that there is no space (match the exception
string or check disk programmatically) - and take the right action. As you
proposed - this can be "put the broker down" - okay, let's implement it
this way.
If there are cases when you want to react differently on this error - no
problem,
let's add some config setting flag (e.g. shutdownOnOutOfMemory).
That's the general idea.
Also, I would try to implement this KIP so that admin could dynamically
add/remove
disks (in this case we won't need restarts at all in running out of space
cases),
but that's the second step, I believe.


1. Again to this item. Will my solution work? if it's not clear I can add
details.
It's good you brought up this situation. I think different edge cases may
affect KIP
design so I'd rather discuss them on the early stage.

Thanks,
Andrii Biletskyi

On Mon, Apr 13, 2015 at 6:16 AM, Jun Rao <j...@confluent.io> wrote:

> Andrii,
>
> 2. So the idea is to immediately start replicating those replicas on the
> failed directory to the other directories? An IOException can be caused by
> a disk running out of space. In this case, perhaps an admin may want to
> bring down the broker, free up some disk space and restart the broker? This
> introduces less data movement.
>
> Thanks,
>
> Jun
>
> On Sun, Apr 12, 2015 at 2:59 PM, Andrii Biletskyi <
> andrii.bilets...@stealth.ly> wrote:
>
> > Jun
> >
> > 1. Hm, it looks like I didn't take this case into account in KIP.
> > I see your point. Why don't we do the same thing as with reassign
> > partitions - let's setup new (or
> > reuse) ReassignedPartitionsIsrChangeListener
> > that will check whether brokers that requested partitions restart
> > catch up (in isr state) and update zk node /restart_partitions to remove
> > irrelevant replicas. - This should be done instead of step 4) -
> Controller
> > deletes zk node.
> >
> > 2. No, the intent _is_ actually make replicas auto-repaired.
> > There are two parts to it - 1) catch IO exceptions, so that the whole
> > broker doesn't crash; 2) request partition restart on io errors -
> re-fetch
> > lost partitions through the described in KIP mechanism.
> > At least, I believe, this is our goal, otherwise there are no benefits.
> >
> > Thanks,
> > Andrii Biletskyi
> >
> >
> > On Sat, Apr 11, 2015 at 2:20 AM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Andrii,
> > >
> > > 1. I was wondering what if the controller fails over after step 4).
> Since
> > > the ZK node is gone, how does the controller know those failed replicas
> > due
> > > to disk failures? Otherwise, the controller will assume those replicas
> > are
> > > alive again.
> > >
> > > 2. Just to clarify. In the proposal, those failed replicas will not be
> > auto
> > > repaired and those affected partitions will just be running in the
> under
> > > replicated mode, right? To repair the failed replicas, the admin still
> > > needs to stop the broker?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Fri, Apr 10, 2015 at 10:29 AM, Andrii Biletskyi <
> > > andrii.bilets...@stealth.ly> wrote:
> > >
> > > > Todd, Jun,
> > > >
> > > > Thanks for comments.
> > > >
> > > > I agree we might want to change "fair" on disk partition assignment
> > > > in scope of these changes. I'm open to suggestions, I didn't bring it
> > > > up here because of the facts that Todd mentioned - there is still  no
> > > > clear understanding who should be responsible for assignment -
> > > > broker or controller.
> > > >
> > > > 1. Yes, the way broker initiates partition restart should be
> discussed.
> > > > But I don't understand the problem with controller failover. The
> > intended
> > > > workflow is the following:
> > > > 0) On error Broker removes partitions from ReplicaManager and
> > LogManager
> > > > 1) Broker creates zk node
> > > > 2) Controller picks up, re-generates leaders and followers for
> > partitions
> > > > 3) Controller sends new LeaderAndIsr and UpdateMetadata to the
> cluster
> > > > 4) Controller deletes zk node
> > > > Now, if controller fails between 3) and 4), yes, controller will send
> > L&I
> > > > requests twice, but broker which requested partition restart will
> > > "ignore"
> > > > second time because partition would have been created at that point -
> > > > while handling "first" L&I request.
> > > >
> > > > 2. The main benefit, from my perspective, is that if currently any
> file
> > > > IO error means broker halts, you have to remove disk, restart the
> > broker,
> > > > with this KIP on IO error we simply reject that single request (or
> any
> > > > action during
> > > > which file IO error occurred), broker detects affected partitions and
> > > > silently
> > > > restarts them, normally handling other requests at the same time (of
> > > course
> > > > if those are not related to the broken disk).
> > > >
> > > > 3. I agree, the lack of tools to perform such operational commands
> > won't
> > > > let us
> > > > fully leverage JBOD architecture. That's why I think we should design
> > it
> > > > that
> > > > way so implementing such tools must be a simple thing to do. But
> before
> > > > that
> > > > it'd be good to understand whether we are on the right path in
> general.
> > > >
> > > > Thanks,
> > > > Andrii Biletskyi
> > > >
> > > > On Fri, Apr 10, 2015 at 6:56 PM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Andrii,
> > > > >
> > > > > Thanks for writing up the proposal. A few thoughts on this.
> > > > >
> > > > > 1. Your proposal is to have the broker notify the controller about
> > > failed
> > > > > replicas. We need to think through this a bit more. The controller
> > may
> > > > fail
> > > > > later. During the controller failover, it needs to be able to
> detect
> > > > those
> > > > > failed replicas again. Otherwise, it may revert some of the
> decisions
> > > > that
> > > > > it has made earlier. In the current proposal, it seems that the
> info
> > > > about
> > > > > the failed replicas will be lost during controller failover?
> > > > >
> > > > > 2. Overall, it's not very clear to me what benefit this proposal
> > > > provides.
> > > > > The proposal seems to detect failed disks and then just marks the
> > > > > associated replicas as offline. How do we bring those replicas to
> > > online
> > > > > again? Do we have to stop the broker and either fix the failed disk
> > or
> > > > > remove it from the configured log dir? If so, there will still be a
> > > down
> > > > > time of the broker. The changes in the proposal is non-trivial. So,
> > we
> > > > need
> > > > > to be certain that we get some significant benefits.
> > > > >
> > > > > 3. As Todd pointed out, it will be worth thinking through other
> > issues
> > > > > related to JBOD.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Apr 9, 2015 at 5:36 AM, Andrii Biletskyi <
> > > > > andrii.bilets...@stealth.ly> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Let me start discussion thread for KIP-18 - JBOD Support.
> > > > > >
> > > > > > Link to wiki:
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Andrii Biletskyi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-18 - JBOD Support

Reply via email to