Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Wed, 01 Mar 2017 13:59:45 -0800

Hey Jun,

Do you think it is OK to keep the existing wire protocol in the KIP? I am
wondering if we can initiate vote for this KIP.


Thanks,
Dong



On Tue, Feb 28, 2017 at 2:41 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Jun,
>
> I just realized that StopReplicaRequest itself doesn't specify the
> replicaId in the wire protocol. Thus controller would need to log the
> brokerId with StopReplicaRequest in the log. Thus it may be
> reasonable for controller to do the same with LeaderAndIsrRequest and only
> specify the isNewReplica for the broker that receives LeaderAndIsrRequest.
>
> Thanks,
> Dong
>
> On Tue, Feb 28, 2017 at 2:14 PM, Dong Lin <lindon...@gmail.com> wrote:
>
>> Hi Jun,
>>
>> Yeah there is tradeoff between controller's implementation complexity vs.
>> wire-protocol complexity. I personally think it is more important to keep
>> wire-protocol concise and only add information in wire-protocol if
>> necessary. It seems fine to add a little bit complexity to controller's
>> implementation, e.g. log destination broker per LeaderAndIsrRequet. Becket
>> also shares this opinion with me. Is the only purpose of doing so to make
>> controller log simpler?
>>
>> And certainly, I have added Todd's comment in the wiki.
>>
>> Thanks,
>> Dong
>>
>>
>> On Tue, Feb 28, 2017 at 1:37 PM, Jun Rao <j...@confluent.io> wrote:
>>
>>> Hi, Dong,
>>>
>>> 52. What you suggested would work. However, I am thinking that it's
>>> probably simpler to just set isNewReplica at the replica level. That way,
>>> the LeaderAndIsrRequest can be created a bit simpler. When reading a
>>> LeaderAndIsrRequest in the controller log, it's easier to see which
>>> replicas are new without looking at which broker the request is intended
>>> for.
>>>
>>> Could you also add those additional points from Todd's on 1 broker per
>>> disk
>>> vs JBOD vs RAID5/6 to the KIP?
>>>
>>> Thanks,
>>>
>>> Hi, Todd,
>>>
>>> Thanks for the feedback. That's very useful.
>>>
>>> Jun
>>>
>>> On Tue, Feb 28, 2017 at 10:25 AM, Dong Lin <lindon...@gmail.com> wrote:
>>>
>>> > Hey Jun,
>>> >
>>> > Certainly, I have added Todd to reply to the thread. And I have
>>> updated the
>>> > item to in the wiki.
>>> >
>>> > 50. The full statement is "Broker assumes a log directory to be good
>>> after
>>> > it starts, and mark log directory as bad once there is IOException when
>>> > broker attempts to access (i.e. read or write) the log directory". This
>>> > statement seems reasonable, right? If a log directory is actually bad,
>>> then
>>> > the broker will first assume it is OK, try to read logs on this log
>>> > directory, encounter IOException, and then mark it as bad.
>>> >
>>> > 51. My bad. I thought I removed it but I didn't. It is removed now.
>>> >
>>> > 52. I don't think so.. The isNewReplica field in the
>>> LeaderAndIsrRequest is
>>> > only relevant to the replica (i.e. broker) that receives the
>>> > LeaderAndIsrRequest. There is no need to specify whether each replica
>>> is
>>> > new inside LeaderAndIsrRequest. In other words, if a broker sends
>>> > LeaderAndIsrRequest to three different replicas of a given partition,
>>> the
>>> > isNewReplica field can be different across these three requests.
>>> >
>>> > Yeah, I would definitely want to start discussion on KIP-113 after we
>>> have
>>> > reached agreement on KIP-112. I have actually opened KIP-113 discussion
>>> > thread on 1/12 together with this thread. I have yet to add the
>>> ability to
>>> > list offline directories in KIP-113 which we discussed in this thread.
>>> >
>>> > Thanks for all your reviews! Is there further concern with the latest
>>> KIP?
>>> >
>>> > Thanks!
>>> > Dong
>>> >
>>> > On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <j...@confluent.io> wrote:
>>> >
>>> > > Hi, Dong,
>>> > >
>>> > > RAID6 is an improvement over RAID5 and can tolerate 2 disks failure.
>>> > Eno's
>>> > > point is that the rebuild of RAID5/RAID6 requires reading more data
>>> > > compared with RAID10, which increases the probability of error during
>>> > > rebuild. This makes sense. In any case, do you think you could ask
>>> the
>>> > SREs
>>> > > at LinkedIn to share their opinions on RAID5/RAID6?
>>> > >
>>> > > Yes, when a replica is offline due to a bad disk, it makes sense to
>>> > handle
>>> > > it immediately as if a StopReplicaRequest is received (i.e., replica
>>> is
>>> > no
>>> > > longer considered a leader and is removed from any replica fetcher
>>> > thread).
>>> > > Could you add that detail in item 2. in the wiki?
>>> > >
>>> > > 50. The wiki says "Broker assumes a log directory to be good after it
>>> > > starts" : A log directory actually could be bad during startup.
>>> > >
>>> > > 51. In item 4, the wiki says "The controller watches the path
>>> > > /log_dir_event_notification for new znode.". This doesn't seem be
>>> needed
>>> > > now?
>>> > >
>>> > > 52. The isNewReplica field in LeaderAndIsrRequest should be for each
>>> > > replica inside the replicas field, right?
>>> > >
>>> > > Other than those, the current KIP looks good to me. Do you want to
>>> start
>>> > a
>>> > > separate discussion thread on KIP-113? I do have some comments there.
>>> > >
>>> > > Thanks for working on this!
>>> > >
>>> > > Jun
>>> > >
>>> > >
>>> > > On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindon...@gmail.com>
>>> wrote:
>>> > >
>>> > > > Hi Jun,
>>> > > >
>>> > > > In addition to the Eno's reference of why rebuild time with RAID-5
>>> is
>>> > > more
>>> > > > expensive, another concern is that RAID-5 will fail if more than
>>> one
>>> > disk
>>> > > > fails. JBOD is still works with 1+ disk failure and has better
>>> > > performance
>>> > > > with one disk failure. These seems like good argument for using
>>> JBOD
>>> > > > instead of RAID-5.
>>> > > >
>>> > > > If a leader replica goes offline, the broker should first take all
>>> > > actions
>>> > > > (i.e. remove the partition from fetcher thread) as if it has
>>> received
>>> > > > StopReplicaRequest for this partition because the replica can no
>>> longer
>>> > > > work anyway. It will also respond with error to any ProduceRequest
>>> and
>>> > > > FetchRequest for partition. The broker notifies controller by
>>> writing
>>> > > > notification znode in ZK. The controller learns the disk failure
>>> event
>>> > > from
>>> > > > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to
>>> > learn
>>> > > > that the replica is offline. The controller will then elect new
>>> leader
>>> > > for
>>> > > > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest
>>> to
>>> > > > relevant brokers. The broker should stop adjusting the ISR for this
>>> > > > partition as if the broker is already offline. I am not sure there
>>> is
>>> > any
>>> > > > inconsistency in broker's behavior when it is leader or follower.
>>> Is
>>> > > there
>>> > > > any concern with this approach?
>>> > > >
>>> > > > Thanks for catching this. I have removed that reference from the
>>> KIP.
>>> > > >
>>> > > > Hi Eno,
>>> > > >
>>> > > > Thank you for providing the reference of the RAID-5. In LinkedIn we
>>> > have
>>> > > 10
>>> > > > disks per Kafka machine. It will not be a show-stopper
>>> operationally
>>> > for
>>> > > > LinkedIn if we have to deploy one-broker-per-disk. On the other
>>> hand we
>>> > > > previously discussed the advantage of JBOD vs. one-broker-per-disk
>>> or
>>> > > > one-broker-per-machine. One-broker-per-disk suffers from the
>>> problems
>>> > > > described in the KIP and one-broker-per-machine increases the
>>> failure
>>> > > > caused by disk failure by 10X. Since JBOD is strictly better than
>>> > either
>>> > > of
>>> > > > the two, it is also better then one-broker-per-multiple-disk which
>>> is
>>> > > > somewhere between one-broker-per-disk and one-broker-per-machine.
>>> > > >
>>> > > > I personally think the benefits of JBOD design is worth the
>>> > > implementation
>>> > > > complexity it introduces. I would also argue that it is reasonable
>>> for
>>> > > > Kafka to manage this low level detail because Kafka is already
>>> exposing
>>> > > and
>>> > > > managing replication factor of its data. But whether the
>>> complexity is
>>> > > > worthwhile can be subjective and I can not prove my opinion. I am
>>> > > > contributing significant amount of time to do this KIP because
>>> Kafka
>>> > > > develops at LinkedIn believes it is useful and worth the effort.
>>> Yeah,
>>> > it
>>> > > > will be useful to see what everyone else think about it.
>>> > > >
>>> > > >
>>> > > > Thanks,
>>> > > > Dong
>>> > > >
>>> > > >
>>> > > > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> wrote:
>>> > > >
>>> > > > > Hi, Dong,
>>> > > > >
>>> > > > > For RAID5, I am not sure the rebuild cost is a big concern. If a
>>> disk
>>> > > > > fails, typically an admin has to bring down the broker, replace
>>> the
>>> > > > failed
>>> > > > > disk with a new one, trigger the RAID rebuild, and bring up the
>>> > broker.
>>> > > > > This way, there is no performance impact at runtime due to
>>> rebuild.
>>> > The
>>> > > > > benefit is that a broker doesn't fail in a hard way when there
>>> is a
>>> > > disk
>>> > > > > failure and can be brought down in a controlled way for
>>> maintenance.
>>> > > > While
>>> > > > > the broker is running with a failed disk, reads may be more
>>> expensive
>>> > > > since
>>> > > > > they have to be computed from the parity. However, if most reads
>>> are
>>> > > from
>>> > > > > page cache, this may not be a big issue either. So, it would be
>>> > useful
>>> > > to
>>> > > > > do some tests on RAID5 before we completely rule it out.
>>> > > > >
>>> > > > > Regarding whether to remove an offline replica from the fetcher
>>> > thread
>>> > > > > immediately. What do we do when a failed replica is a leader? Do
>>> we
>>> > do
>>> > > > > nothing or mark the replica as not the leader immediately?
>>> > Intuitively,
>>> > > > it
>>> > > > > seems it's better if the broker acts consistently on a failed
>>> replica
>>> > > > > whether it's a leader or a follower. For ISR churns, I was just
>>> > > pointing
>>> > > > > out that if we don't send StopReplicaRequest to a broker to be
>>> shut
>>> > > down
>>> > > > in
>>> > > > > a controlled way, then the leader will shrink ISR, expand it and
>>> > shrink
>>> > > > it
>>> > > > > again after the timeout.
>>> > > > >
>>> > > > > The KIP seems to still reference "
>>> > > > > /broker/topics/[topic]/partitions/[partitionId]/
>>> > > > controller_managed_state".
>>> > > > >
>>> > > > > Thanks,
>>> > > > >
>>> > > > > Jun
>>> > > > >
>>> > > > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com>
>>> > wrote:
>>> > > > >
>>> > > > > > Hey Jun,
>>> > > > > >
>>> > > > > > Thanks for the suggestion. I think it is a good idea to know
>>> put
>>> > > > created
>>> > > > > > flag in ZK and simply specify isNewReplica=true in
>>> > > LeaderAndIsrRequest
>>> > > > if
>>> > > > > > repilcas was in NewReplica state. It will only fail the replica
>>> > > > creation
>>> > > > > in
>>> > > > > > the scenario that the controller fails after
>>> > > > > > topic-creation/partition-reassignment/partition-number-change
>>> but
>>> > > > before
>>> > > > > > actually sends out the LeaderAndIsrRequest while there is
>>> ongoing
>>> > > disk
>>> > > > > > failure, which should be pretty rare and acceptable. This
>>> should
>>> > > > simplify
>>> > > > > > the design of this KIP.
>>> > > > > >
>>> > > > > > Regarding RAID-5, I think the concern with RAID-5/6 is not just
>>> > about
>>> > > > > > performance when there is no failure. For example, RAID-5 can
>>> > support
>>> > > > up
>>> > > > > to
>>> > > > > > one disk failure and it takes time to rebuild disk after one
>>> disk
>>> > > > > > failure. RAID 5 implementations are susceptible to system
>>> failures
>>> > > > > because
>>> > > > > > of trends regarding array rebuild time and the chance of drive
>>> > > failure
>>> > > > > > during rebuild. There is no such performance degradation for
>>> JBOD
>>> > and
>>> > > > > JBOD
>>> > > > > > can support multiple log directory failure without reducing
>>> > > performance
>>> > > > > of
>>> > > > > > good log directories. Would this be a reasonable reason for
>>> using
>>> > > JBOD
>>> > > > > > instead of RAID-5/6?
>>> > > > > >
>>> > > > > > Previously we discussed wether broker should remove offline
>>> replica
>>> > > > from
>>> > > > > > replica fetcher thread. I still think it should do it instead
>>> of
>>> > > > > printing a
>>> > > > > > lot of error in the log4j log. We can still let controller send
>>> > > > > > StopReplicaRequest to the broker. I am not sure I undertand why
>>> > > > allowing
>>> > > > > > broker to remove offline replica from fetcher thread will
>>> increase
>>> > > > churns
>>> > > > > > in ISR. Do you think this is concern with this approach?
>>> > > > > >
>>> > > > > > I have updated the KIP to remove created flag from ZK and
>>> change
>>> > the
>>> > > > > filed
>>> > > > > > name to isNewReplica. Can you check if there is any issue with
>>> the
>>> > > > latest
>>> > > > > > KIP? Thanks for your time!
>>> > > > > >
>>> > > > > > Regards,
>>> > > > > > Dong
>>> > > > > >
>>> > > > > >
>>> > > > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io>
>>> wrote:
>>> > > > > >
>>> > > > > > > Hi, Dong,
>>> > > > > > >
>>> > > > > > > Thanks for the reply.
>>> > > > > > >
>>> > > > > > > Personally, I'd prefer not to write the created flag per
>>> replica
>>> > in
>>> > > > ZK.
>>> > > > > > > Your suggestion of disabling replica creation if there is a
>>> bad
>>> > log
>>> > > > > > > directory on the broker could work. The only thing is that
>>> it may
>>> > > > delay
>>> > > > > > the
>>> > > > > > > creation of new replicas. I was thinking that an alternative
>>> is
>>> > to
>>> > > > > extend
>>> > > > > > > LeaderAndIsrRequest by adding a isNewReplica field per
>>> replica.
>>> > > That
>>> > > > > > field
>>> > > > > > > will be set when a replica is transitioning from the
>>> NewReplica
>>> > > state
>>> > > > > to
>>> > > > > > > Online state. Then, when a broker receives a
>>> LeaderAndIsrRequest,
>>> > > if
>>> > > > a
>>> > > > > > > replica is marked as the new replica, it will be created on a
>>> > good
>>> > > > log
>>> > > > > > > directory, if not already present. Otherwise, it only
>>> creates the
>>> > > > > replica
>>> > > > > > > if all log directories are good and the replica is not
>>> already
>>> > > > present.
>>> > > > > > > This way, we don't delay the processing of new replicas in
>>> the
>>> > > common
>>> > > > > > case.
>>> > > > > > >
>>> > > > > > > I am ok with not persisting the offline replicas in ZK and
>>> just
>>> > > > > > discovering
>>> > > > > > > them through the LeaderAndIsrRequest. It handles the cases
>>> when a
>>> > > > > broker
>>> > > > > > > starts up with bad log directories better. So, the additional
>>> > > > overhead
>>> > > > > of
>>> > > > > > > rediscovering the offline replicas is justified.
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > Another high level question. The proposal rejected RAID5/6
>>> since
>>> > it
>>> > > > > adds
>>> > > > > > > additional I/Os. The main issue with RAID5 is that to write a
>>> > block
>>> > > > > that
>>> > > > > > > doesn't match the RAID stripe size, we have to first read
>>> the old
>>> > > > > parity
>>> > > > > > to
>>> > > > > > > compute the new one, which increases the number of I/Os (
>>> > > > > > > http://rickardnobel.se/raid-5-write-penalty/). I am
>>> wondering if
>>> > > you
>>> > > > > > have
>>> > > > > > > tested RAID5's performance by creating a file system whose
>>> block
>>> > > size
>>> > > > > > > matches the RAID stripe size (https://www.percona.com/blog/
>>> > > > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This way,
>>> > writing
>>> > > a
>>> > > > > > block
>>> > > > > > > doesn't require a read first. A large block size may
>>> increase the
>>> > > > > amount
>>> > > > > > of
>>> > > > > > > data writes, when the same block has to be written to disk
>>> > multiple
>>> > > > > > times.
>>> > > > > > > However, this is probably ok in Kafka's use case since we
>>> batch
>>> > the
>>> > > > I/O
>>> > > > > > > flush already. As you can see, we will be adding some
>>> complexity
>>> > to
>>> > > > > > support
>>> > > > > > > JBOD in Kafka one way or another. If we can tune the
>>> performance
>>> > of
>>> > > > > RAID5
>>> > > > > > > to match that of RAID10, perhaps using RAID5 is a simpler
>>> > solution.
>>> > > > > > >
>>> > > > > > > Thanks,
>>> > > > > > >
>>> > > > > > > Jun
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <
>>> lindon...@gmail.com>
>>> > > > > wrote:
>>> > > > > > >
>>> > > > > > > > Hey Jun,
>>> > > > > > > >
>>> > > > > > > > I don't think we should allow failed replicas to be
>>> re-created
>>> > on
>>> > > > the
>>> > > > > > > good
>>> > > > > > > > disks. Say there are 2 disks and each of them is 51%
>>> loaded. If
>>> > > any
>>> > > > > > disk
>>> > > > > > > > fail, and we allow replicas to be re-created on the other
>>> > disks,
>>> > > > both
>>> > > > > > > disks
>>> > > > > > > > will fail. Alternatively we can disable replica creation if
>>> > there
>>> > > > is
>>> > > > > > bad
>>> > > > > > > > disk on a broker. I personally think it is worth the
>>> additional
>>> > > > > > > complexity
>>> > > > > > > > in the broker to store created replicas in ZK so that we
>>> allow
>>> > > new
>>> > > > > > > replicas
>>> > > > > > > > to be created on the broker even when there is bad log
>>> > directory.
>>> > > > > This
>>> > > > > > > > approach won't add complexity in the controller. But I am
>>> fine
>>> > > with
>>> > > > > > > > disabling replica creation when there is bad log directory
>>> that
>>> > > if
>>> > > > it
>>> > > > > > is
>>> > > > > > > > the only blocking issue for this KIP.
>>> > > > > > > >
>>> > > > > > > > Whether we store created flags is independent of
>>> whether/how we
>>> > > > store
>>> > > > > > > > offline replicas. Per our previous discussion, do you
>>> think it
>>> > is
>>> > > > OK
>>> > > > > > not
>>> > > > > > > > store offline replicas in ZK and propagate the offline
>>> replicas
>>> > > > from
>>> > > > > > > broker
>>> > > > > > > > to controller via LeaderAndIsrRequest?
>>> > > > > > > >
>>> > > > > > > > Thanks,
>>> > > > > > > > Dong
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to