Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Jun Rao Thu, 02 Mar 2017 09:43:27 -0800

Hi, Dong,

Ok. We can keep LeaderAndIsrRequest as it is in the wiki.


Since we need both KIP-112 and KIP-113 to make a compelling case for JBOD,
perhaps we should discuss KIP-113 before voting for both? I left some
comments in the other thread.

Thanks,

Jun

On Wed, Mar 1, 2017 at 1:58 PM, Dong Lin <[email protected]> wrote:

> Hey Jun,
>
> Do you think it is OK to keep the existing wire protocol in the KIP? I am
> wondering if we can initiate vote for this KIP.
>
> Thanks,
> Dong
>
>
>
> On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin <[email protected]> wrote:
>
> > Hey Jun,
> >
> > I just realized that StopReplicaRequest itself doesn't specify the
> > replicaId in the wire protocol. Thus controller would need to log the
> > brokerId with StopReplicaRequest in the log. Thus it may be
> > reasonable for controller to do the same with LeaderAndIsrRequest and
> only
> > specify the isNewReplica for the broker that receives
> LeaderAndIsrRequest.
> >
> > Thanks,
> > Dong
> >
> > On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin <[email protected]> wrote:
> >
> >> Hi Jun,
> >>
> >> Yeah there is tradeoff between controller's implementation complexity
> vs.
> >> wire-protocol complexity. I personally think it is more important to
> keep
> >> wire-protocol concise and only add information in wire-protocol if
> >> necessary. It seems fine to add a little bit complexity to controller's
> >> implementation, e.g. log destination broker per LeaderAndIsrRequet.
> Becket
> >> also shares this opinion with me. Is the only purpose of doing so to
> make
> >> controller log simpler?
> >>
> >> And certainly, I have added Todd's comment in the wiki.
> >>
> >> Thanks,
> >> Dong
> >>
> >>
> >> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao <[email protected]> wrote:
> >>
> >>> Hi, Dong,
> >>>
> >>> 52. What you suggested would work. However, I am thinking that it's
> >>> probably simpler to just set isNewReplica at the replica level. That
> way,
> >>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
> >>> LeaderAndIsrRequest in the controller log, it's easier to see which
> >>> replicas are new without looking at which broker the request is
> intended
> >>> for.
> >>>
> >>> Could you also add those additional points from Todd's on 1 broker per
> >>> disk
> >>> vs JBOD vs RAID5/6 to the KIP?
> >>>
> >>> Thanks,
> >>>
> >>> Hi, Todd,
> >>>
> >>> Thanks for the feedback. That's very useful.
> >>>
> >>> Jun
> >>>
> >>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin <[email protected]>
> wrote:
> >>>
> >>> > Hey Jun,
> >>> >
> >>> > Certainly, I have added Todd to reply to the thread. And I have
> >>> updated the
> >>> > item to in the wiki.
> >>> >
> >>> > 50. The full statement is "Broker assumes a log directory to be good
> >>> after
> >>> > it starts, and mark log directory as bad once there is IOException
> when
> >>> > broker attempts to access (i.e. read or write) the log directory".
> This
> >>> > statement seems reasonable, right? If a log directory is actually
> bad,
> >>> then
> >>> > the broker will first assume it is OK, try to read logs on this log
> >>> > directory, encounter IOException, and then mark it as bad.
> >>> >
> >>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
> >>> >
> >>> > 52. I don't think so.. The isNewReplica field in the
> >>> LeaderAndIsrRequest is
> >>> > only relevant to the replica (i.e. broker) that receives the
> >>> > LeaderAndIsrRequest. There is no need to specify whether each replica
> >>> is
> >>> > new inside LeaderAndIsrRequest. In other words, if a broker sends
> >>> > LeaderAndIsrRequest to three different replicas of a given partition,
> >>> the
> >>> > isNewReplica field can be different across these three requests.
> >>> >
> >>> > Yeah, I would definitely want to start discussion on KIP-113 after we
> >>> have
> >>> > reached agreement on KIP-112. I have actually opened KIP-113
> discussion
> >>> > thread on 1/12 together with this thread. I have yet to add the
> >>> ability to
> >>> > list offline directories in KIP-113 which we discussed in this
> thread.
> >>> >
> >>> > Thanks for all your reviews! Is there further concern with the latest
> >>> KIP?
> >>> >
> >>> > Thanks!
> >>> > Dong
> >>> >
> >>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <[email protected]> wrote:
> >>> >
> >>> > > Hi, Dong,
> >>> > >
> >>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks
> failure.
> >>> > Eno's
> >>> > > point is that the rebuild of RAID5/RAID6 requires reading more data
> >>> > > compared with RAID10, which increases the probability of error
> during
> >>> > > rebuild. This makes sense. In any case, do you think you could ask
> >>> the
> >>> > SREs
> >>> > > at LinkedIn to share their opinions on RAID5/RAID6?
> >>> > >
> >>> > > Yes, when a replica is offline due to a bad disk, it makes sense to
> >>> > handle
> >>> > > it immediately as if a StopReplicaRequest is received (i.e.,
> replica
> >>> is
> >>> > no
> >>> > > longer considered a leader and is removed from any replica fetcher
> >>> > thread).
> >>> > > Could you add that detail in item 2. in the wiki?
> >>> > >
> >>> > > 50. The wiki says "Broker assumes a log directory to be good after
> it
> >>> > > starts" : A log directory actually could be bad during startup.
> >>> > >
> >>> > > 51. In item 4, the wiki says "The controller watches the path
> >>> > > /log_dir_event_notification for new znode.". This doesn't seem be
> >>> needed
> >>> > > now?
> >>> > >
> >>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for
> each
> >>> > > replica inside the replicas field, right?
> >>> > >
> >>> > > Other than those, the current KIP looks good to me. Do you want to
> >>> start
> >>> > a
> >>> > > separate discussion thread on KIP-113? I do have some comments
> there.
> >>> > >
> >>> > > Thanks for working on this!
> >>> > >
> >>> > > Jun
> >>> > >
> >>> > >
> >>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <[email protected]>
> >>> wrote:
> >>> > >
> >>> > > > Hi Jun,
> >>> > > >
> >>> > > > In addition to the Eno's reference of why rebuild time with
> RAID-5
> >>> is
> >>> > > more
> >>> > > > expensive, another concern is that RAID-5 will fail if more than
> >>> one
> >>> > disk
> >>> > > > fails. JBOD is still works with 1+ disk failure and has better
> >>> > > performance
> >>> > > > with one disk failure. These seems like good argument for using
> >>> JBOD
> >>> > > > instead of RAID-5.
> >>> > > >
> >>> > > > If a leader replica goes offline, the broker should first take
> all
> >>> > > actions
> >>> > > > (i.e. remove the partition from fetcher thread) as if it has
> >>> received
> >>> > > > StopReplicaRequest for this partition because the replica can no
> >>> longer
> >>> > > > work anyway. It will also respond with error to any
> ProduceRequest
> >>> and
> >>> > > > FetchRequest for partition. The broker notifies controller by
> >>> writing
> >>> > > > notification znode in ZK. The controller learns the disk failure
> >>> event
> >>> > > from
> >>> > > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse
> to
> >>> > learn
> >>> > > > that the replica is offline. The controller will then elect new
> >>> leader
> >>> > > for
> >>> > > > this partition and sends LeaderAndIsrRequest/MetadataUp
> dateRequest
> >>> to
> >>> > > > relevant brokers. The broker should stop adjusting the ISR for
> this
> >>> > > > partition as if the broker is already offline. I am not sure
> there
> >>> is
> >>> > any
> >>> > > > inconsistency in broker's behavior when it is leader or follower.
> >>> Is
> >>> > > there
> >>> > > > any concern with this approach?
> >>> > > >
> >>> > > > Thanks for catching this. I have removed that reference from the
> >>> KIP.
> >>> > > >
> >>> > > > Hi Eno,
> >>> > > >
> >>> > > > Thank you for providing the reference of the RAID-5. In LinkedIn
> we
> >>> > have
> >>> > > 10
> >>> > > > disks per Kafka machine. It will not be a show-stopper
> >>> operationally
> >>> > for
> >>> > > > LinkedIn if we have to deploy one-broker-per-disk. On the other
> >>> hand we
> >>> > > > previously discussed the advantage of JBOD vs.
> one-broker-per-disk
> >>> or
> >>> > > > one-broker-per-machine. One-broker-per-disk suffers from the
> >>> problems
> >>> > > > described in the KIP and one-broker-per-machine increases the
> >>> failure
> >>> > > > caused by disk failure by 10X. Since JBOD is strictly better than
> >>> > either
> >>> > > of
> >>> > > > the two, it is also better then one-broker-per-multiple-disk
> which
> >>> is
> >>> > > > somewhere between one-broker-per-disk and one-broker-per-machine.
> >>> > > >
> >>> > > > I personally think the benefits of JBOD design is worth the
> >>> > > implementation
> >>> > > > complexity it introduces. I would also argue that it is
> reasonable
> >>> for
> >>> > > > Kafka to manage this low level detail because Kafka is already
> >>> exposing
> >>> > > and
> >>> > > > managing replication factor of its data. But whether the
> >>> complexity is
> >>> > > > worthwhile can be subjective and I can not prove my opinion. I am
> >>> > > > contributing significant amount of time to do this KIP because
> >>> Kafka
> >>> > > > develops at LinkedIn believes it is useful and worth the effort.
> >>> Yeah,
> >>> > it
> >>> > > > will be useful to see what everyone else think about it.
> >>> > > >
> >>> > > >
> >>> > > > Thanks,
> >>> > > > Dong
> >>> > > >
> >>> > > >
> >>> > > > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <[email protected]>
> wrote:
> >>> > > >
> >>> > > > > Hi, Dong,
> >>> > > > >
> >>> > > > > For RAID5, I am not sure the rebuild cost is a big concern. If
> a
> >>> disk
> >>> > > > > fails, typically an admin has to bring down the broker, replace
> >>> the
> >>> > > > failed
> >>> > > > > disk with a new one, trigger the RAID rebuild, and bring up the
> >>> > broker.
> >>> > > > > This way, there is no performance impact at runtime due to
> >>> rebuild.
> >>> > The
> >>> > > > > benefit is that a broker doesn't fail in a hard way when there
> >>> is a
> >>> > > disk
> >>> > > > > failure and can be brought down in a controlled way for
> >>> maintenance.
> >>> > > > While
> >>> > > > > the broker is running with a failed disk, reads may be more
> >>> expensive
> >>> > > > since
> >>> > > > > they have to be computed from the parity. However, if most
> reads
> >>> are
> >>> > > from
> >>> > > > > page cache, this may not be a big issue either. So, it would be
> >>> > useful
> >>> > > to
> >>> > > > > do some tests on RAID5 before we completely rule it out.
> >>> > > > >
> >>> > > > > Regarding whether to remove an offline replica from the fetcher
> >>> > thread
> >>> > > > > immediately. What do we do when a failed replica is a leader?
> Do
> >>> we
> >>> > do
> >>> > > > > nothing or mark the replica as not the leader immediately?
> >>> > Intuitively,
> >>> > > > it
> >>> > > > > seems it's better if the broker acts consistently on a failed
> >>> replica
> >>> > > > > whether it's a leader or a follower. For ISR churns, I was just
> >>> > > pointing
> >>> > > > > out that if we don't send StopReplicaRequest to a broker to be
> >>> shut
> >>> > > down
> >>> > > > in
> >>> > > > > a controlled way, then the leader will shrink ISR, expand it
> and
> >>> > shrink
> >>> > > > it
> >>> > > > > again after the timeout.
> >>> > > > >
> >>> > > > > The KIP seems to still reference "
> >>> > > > > /broker/topics/[topic]/partitions/[partitionId]/
> >>> > > > controller_managed_state".
> >>> > > > >
> >>> > > > > Thanks,
> >>> > > > >
> >>> > > > > Jun
> >>> > > > >
> >>> > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <[email protected]
> >
> >>> > wrote:
> >>> > > > >
> >>> > > > > > Hey Jun,
> >>> > > > > >
> >>> > > > > > Thanks for the suggestion. I think it is a good idea to know
> >>> put
> >>> > > > created
> >>> > > > > > flag in ZK and simply specify isNewReplica=true in
> >>> > > LeaderAndIsrRequest
> >>> > > > if
> >>> > > > > > repilcas was in NewReplica state. It will only fail the
> replica
> >>> > > > creation
> >>> > > > > in
> >>> > > > > > the scenario that the controller fails after
> >>> > > > > > topic-creation/partition-reassignment/partition-number-chang
> e
> >>> but
> >>> > > > before
> >>> > > > > > actually sends out the LeaderAndIsrRequest while there is
> >>> ongoing
> >>> > > disk
> >>> > > > > > failure, which should be pretty rare and acceptable. This
> >>> should
> >>> > > > simplify
> >>> > > > > > the design of this KIP.
> >>> > > > > >
> >>> > > > > > Regarding RAID-5, I think the concern with RAID-5/6 is not
> just
> >>> > about
> >>> > > > > > performance when there is no failure. For example, RAID-5 can
> >>> > support
> >>> > > > up
> >>> > > > > to
> >>> > > > > > one disk failure and it takes time to rebuild disk after one
> >>> disk
> >>> > > > > > failure. RAID 5 implementations are susceptible to system
> >>> failures
> >>> > > > > because
> >>> > > > > > of trends regarding array rebuild time and the chance of
> drive
> >>> > > failure
> >>> > > > > > during rebuild. There is no such performance degradation for
> >>> JBOD
> >>> > and
> >>> > > > > JBOD
> >>> > > > > > can support multiple log directory failure without reducing
> >>> > > performance
> >>> > > > > of
> >>> > > > > > good log directories. Would this be a reasonable reason for
> >>> using
> >>> > > JBOD
> >>> > > > > > instead of RAID-5/6?
> >>> > > > > >
> >>> > > > > > Previously we discussed wether broker should remove offline
> >>> replica
> >>> > > > from
> >>> > > > > > replica fetcher thread. I still think it should do it instead
> >>> of
> >>> > > > > printing a
> >>> > > > > > lot of error in the log4j log. We can still let controller
> send
> >>> > > > > > StopReplicaRequest to the broker. I am not sure I undertand
> why
> >>> > > > allowing
> >>> > > > > > broker to remove offline replica from fetcher thread will
> >>> increase
> >>> > > > churns
> >>> > > > > > in ISR. Do you think this is concern with this approach?
> >>> > > > > >
> >>> > > > > > I have updated the KIP to remove created flag from ZK and
> >>> change
> >>> > the
> >>> > > > > filed
> >>> > > > > > name to isNewReplica. Can you check if there is any issue
> with
> >>> the
> >>> > > > latest
> >>> > > > > > KIP? Thanks for your time!
> >>> > > > > >
> >>> > > > > > Regards,
> >>> > > > > > Dong
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <[email protected]>
> >>> wrote:
> >>> > > > > >
> >>> > > > > > > Hi, Dong,
> >>> > > > > > >
> >>> > > > > > > Thanks for the reply.
> >>> > > > > > >
> >>> > > > > > > Personally, I'd prefer not to write the created flag per
> >>> replica
> >>> > in
> >>> > > > ZK.
> >>> > > > > > > Your suggestion of disabling replica creation if there is a
> >>> bad
> >>> > log
> >>> > > > > > > directory on the broker could work. The only thing is that
> >>> it may
> >>> > > > delay
> >>> > > > > > the
> >>> > > > > > > creation of new replicas. I was thinking that an
> alternative
> >>> is
> >>> > to
> >>> > > > > extend
> >>> > > > > > > LeaderAndIsrRequest by adding a isNewReplica field per
> >>> replica.
> >>> > > That
> >>> > > > > > field
> >>> > > > > > > will be set when a replica is transitioning from the
> >>> NewReplica
> >>> > > state
> >>> > > > > to
> >>> > > > > > > Online state. Then, when a broker receives a
> >>> LeaderAndIsrRequest,
> >>> > > if
> >>> > > > a
> >>> > > > > > > replica is marked as the new replica, it will be created
> on a
> >>> > good
> >>> > > > log
> >>> > > > > > > directory, if not already present. Otherwise, it only
> >>> creates the
> >>> > > > > replica
> >>> > > > > > > if all log directories are good and the replica is not
> >>> already
> >>> > > > present.
> >>> > > > > > > This way, we don't delay the processing of new replicas in
> >>> the
> >>> > > common
> >>> > > > > > case.
> >>> > > > > > >
> >>> > > > > > > I am ok with not persisting the offline replicas in ZK and
> >>> just
> >>> > > > > > discovering
> >>> > > > > > > them through the LeaderAndIsrRequest. It handles the cases
> >>> when a
> >>> > > > > broker
> >>> > > > > > > starts up with bad log directories better. So, the
> additional
> >>> > > > overhead
> >>> > > > > of
> >>> > > > > > > rediscovering the offline replicas is justified.
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > > > > Another high level question. The proposal rejected RAID5/6
> >>> since
> >>> > it
> >>> > > > > adds
> >>> > > > > > > additional I/Os. The main issue with RAID5 is that to
> write a
> >>> > block
> >>> > > > > that
> >>> > > > > > > doesn't match the RAID stripe size, we have to first read
> >>> the old
> >>> > > > > parity
> >>> > > > > > to
> >>> > > > > > > compute the new one, which increases the number of I/Os (
> >>> > > > > > > http://rickardnobel.se/raid-5-write-penalty/). I am
> >>> wondering if
> >>> > > you
> >>> > > > > > have
> >>> > > > > > > tested RAID5's performance by creating a file system whose
> >>> block
> >>> > > size
> >>> > > > > > > matches the RAID stripe size (
> https://www.percona.com/blog/
> >>> > > > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This way,
> >>> > writing
> >>> > > a
> >>> > > > > > block
> >>> > > > > > > doesn't require a read first. A large block size may
> >>> increase the
> >>> > > > > amount
> >>> > > > > > of
> >>> > > > > > > data writes, when the same block has to be written to disk
> >>> > multiple
> >>> > > > > > times.
> >>> > > > > > > However, this is probably ok in Kafka's use case since we
> >>> batch
> >>> > the
> >>> > > > I/O
> >>> > > > > > > flush already. As you can see, we will be adding some
> >>> complexity
> >>> > to
> >>> > > > > > support
> >>> > > > > > > JBOD in Kafka one way or another. If we can tune the
> >>> performance
> >>> > of
> >>> > > > > RAID5
> >>> > > > > > > to match that of RAID10, perhaps using RAID5 is a simpler
> >>> > solution.
> >>> > > > > > >
> >>> > > > > > > Thanks,
> >>> > > > > > >
> >>> > > > > > > Jun
> >>> > > > > > >
> >>> > > > > > >
> >>> > > > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <
> >>> [email protected]>
> >>> > > > > wrote:
> >>> > > > > > >
> >>> > > > > > > > Hey Jun,
> >>> > > > > > > >
> >>> > > > > > > > I don't think we should allow failed replicas to be
> >>> re-created
> >>> > on
> >>> > > > the
> >>> > > > > > > good
> >>> > > > > > > > disks. Say there are 2 disks and each of them is 51%
> >>> loaded. If
> >>> > > any
> >>> > > > > > disk
> >>> > > > > > > > fail, and we allow replicas to be re-created on the other
> >>> > disks,
> >>> > > > both
> >>> > > > > > > disks
> >>> > > > > > > > will fail. Alternatively we can disable replica creation
> if
> >>> > there
> >>> > > > is
> >>> > > > > > bad
> >>> > > > > > > > disk on a broker. I personally think it is worth the
> >>> additional
> >>> > > > > > > complexity
> >>> > > > > > > > in the broker to store created replicas in ZK so that we
> >>> allow
> >>> > > new
> >>> > > > > > > replicas
> >>> > > > > > > > to be created on the broker even when there is bad log
> >>> > directory.
> >>> > > > > This
> >>> > > > > > > > approach won't add complexity in the controller. But I am
> >>> fine
> >>> > > with
> >>> > > > > > > > disabling replica creation when there is bad log
> directory
> >>> that
> >>> > > if
> >>> > > > it
> >>> > > > > > is
> >>> > > > > > > > the only blocking issue for this KIP.
> >>> > > > > > > >
> >>> > > > > > > > Whether we store created flags is independent of
> >>> whether/how we
> >>> > > > store
> >>> > > > > > > > offline replicas. Per our previous discussion, do you
> >>> think it
> >>> > is
> >>> > > > OK
> >>> > > > > > not
> >>> > > > > > > > store offline replicas in ZK and propagate the offline
> >>> replicas
> >>> > > > from
> >>> > > > > > > broker
> >>> > > > > > > > to controller via LeaderAndIsrRequest?
> >>> > > > > > > >
> >>> > > > > > > > Thanks,
> >>> > > > > > > > Dong
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to