Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Todd Palino Tue, 28 Feb 2017 10:15:48 -0800

We have tested RAID 5/6 in the past (and recently) and found it to be
lacking. So, as noted, rebuild takes more time than RAID 10 because all the
disks need to be accessed to recalculate parity. In addition, there’s a
significant performance loss just in normal operations. It’s been a while
since I ran those tests, but it was in the 30-50% range - nothing to shrug
off. We didn’t even get to failure testing because of that.


Jun - to your question, we ran the tests with numerous combinations of
block sizes and FS parameters. The performance varied, but it was never
good enough to warrant more than a superficial look at using RAID 5/6. We
also tested both software RAID and hardware RAID.

As far as the operational concerns around broker-per-disk and
broker-per-server, we’ve been talking about this internally. Running one
broker per disk adds a good bit of administrative overhead and complexity.
If you perform a one by one rolling bounce of the cluster, you’re talking
about a 10x increase in time. That means a cluster that restarts in 30
minutes now takes 5 hours. If you try and optimize this by shutting down
all the brokers on one host at a time, you can get close to the original
number, but you now have added operational complexity by having to
micro-manage the bounce. The broker count increase will percolate down to
the rest of the administrative domain as well - maintaining ports for all
the instances, monitoring more instances, managing configs, etc.

You also have the overhead of running the extra processes - extra heap,
task switching, etc. We don’t have a problem with page cache really, since
the VM subsystem is fairly efficient about how it works. But just because
cache works doesn’t mean we’re not wasting other resources. And that gets
pushed downstream to clients as well, because they all have to maintain
more network connections and the resources that go along with it.

Running more brokers in a cluster also exposes you to more corner cases and
race conditions within the Kafka code. Bugs in the brokers, bugs in the
controllers, more complexity in balancing load in a cluster (though trying
to balance load across disks in a single broker doing JBOD negates that).

-Todd


On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <j...@confluent.io> wrote:

> Hi, Dong,
>
> RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. Eno's
> point is that the rebuild of RAID5/RAID6 requires reading more data
> compared with RAID10, which increases the probability of error during
> rebuild. This makes sense. In any case, do you think you could ask the SREs
> at LinkedIn to share their opinions on RAID5/RAID6?
>
> Yes, when a replica is offline due to a bad disk, it makes sense to handle
> it immediately as if a StopReplicaRequest is received (i.e., replica is no
> longer considered a leader and is removed from any replica fetcher thread).
> Could you add that detail in item 2. in the wiki?
>
> 50. The wiki says "Broker assumes a log directory to be good after it
> starts" : A log directory actually could be bad during startup.
>
> 51. In item 4, the wiki says "The controller watches the path
> /log_dir_event_notification for new znode.". This doesn't seem be needed
> now?
>
> 52. The isNewReplica field in LeaderAndIsrRequest should be for each
> replica inside the replicas field, right?
>
> Other than those, the current KIP looks good to me. Do you want to start a
> separate discussion thread on KIP-113? I do have some comments there.
>
> Thanks for working on this!
>
> Jun
>
>
> On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hi Jun,
> >
> > In addition to the Eno's reference of why rebuild time with RAID-5 is
> more
> > expensive, another concern is that RAID-5 will fail if more than one disk
> > fails. JBOD is still works with 1+ disk failure and has better
> performance
> > with one disk failure. These seems like good argument for using JBOD
> > instead of RAID-5.
> >
> > If a leader replica goes offline, the broker should first take all
> actions
> > (i.e. remove the partition from fetcher thread) as if it has received
> > StopReplicaRequest for this partition because the replica can no longer
> > work anyway. It will also respond with error to any ProduceRequest and
> > FetchRequest for partition. The broker notifies controller by writing
> > notification znode in ZK. The controller learns the disk failure event
> from
> > ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
> > that the replica is offline. The controller will then elect new leader
> for
> > this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
> > relevant brokers. The broker should stop adjusting the ISR for this
> > partition as if the broker is already offline. I am not sure there is any
> > inconsistency in broker's behavior when it is leader or follower. Is
> there
> > any concern with this approach?
> >
> > Thanks for catching this. I have removed that reference from the KIP.
> >
> > Hi Eno,
> >
> > Thank you for providing the reference of the RAID-5. In LinkedIn we have
> 10
> > disks per Kafka machine. It will not be a show-stopper operationally for
> > LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
> > previously discussed the advantage of JBOD vs. one-broker-per-disk or
> > one-broker-per-machine. One-broker-per-disk suffers from the problems
> > described in the KIP and one-broker-per-machine increases the failure
> > caused by disk failure by 10X. Since JBOD is strictly better than either
> of
> > the two, it is also better then one-broker-per-multiple-disk which is
> > somewhere between one-broker-per-disk and one-broker-per-machine.
> >
> > I personally think the benefits of JBOD design is worth the
> implementation
> > complexity it introduces. I would also argue that it is reasonable for
> > Kafka to manage this low level detail because Kafka is already exposing
> and
> > managing replication factor of its data. But whether the complexity is
> > worthwhile can be subjective and I can not prove my opinion. I am
> > contributing significant amount of time to do this KIP because Kafka
> > develops at LinkedIn believes it is useful and worth the effort. Yeah, it
> > will be useful to see what everyone else think about it.
> >
> >
> > Thanks,
> > Dong
> >
> >
> > On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Dong,
> > >
> > > For RAID5, I am not sure the rebuild cost is a big concern. If a disk
> > > fails, typically an admin has to bring down the broker, replace the
> > failed
> > > disk with a new one, trigger the RAID rebuild, and bring up the broker.
> > > This way, there is no performance impact at runtime due to rebuild. The
> > > benefit is that a broker doesn't fail in a hard way when there is a
> disk
> > > failure and can be brought down in a controlled way for maintenance.
> > While
> > > the broker is running with a failed disk, reads may be more expensive
> > since
> > > they have to be computed from the parity. However, if most reads are
> from
> > > page cache, this may not be a big issue either. So, it would be useful
> to
> > > do some tests on RAID5 before we completely rule it out.
> > >
> > > Regarding whether to remove an offline replica from the fetcher thread
> > > immediately. What do we do when a failed replica is a leader? Do we do
> > > nothing or mark the replica as not the leader immediately? Intuitively,
> > it
> > > seems it's better if the broker acts consistently on a failed replica
> > > whether it's a leader or a follower. For ISR churns, I was just
> pointing
> > > out that if we don't send StopReplicaRequest to a broker to be shut
> down
> > in
> > > a controlled way, then the leader will shrink ISR, expand it and shrink
> > it
> > > again after the timeout.
> > >
> > > The KIP seems to still reference "
> > > /broker/topics/[topic]/partitions/[partitionId]/
> > controller_managed_state".
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > Thanks for the suggestion. I think it is a good idea to know put
> > created
> > > > flag in ZK and simply specify isNewReplica=true in
> LeaderAndIsrRequest
> > if
> > > > repilcas was in NewReplica state. It will only fail the replica
> > creation
> > > in
> > > > the scenario that the controller fails after
> > > > topic-creation/partition-reassignment/partition-number-change but
> > before
> > > > actually sends out the LeaderAndIsrRequest while there is ongoing
> disk
> > > > failure, which should be pretty rare and acceptable. This should
> > simplify
> > > > the design of this KIP.
> > > >
> > > > Regarding RAID-5, I think the concern with RAID-5/6 is not just about
> > > > performance when there is no failure. For example, RAID-5 can support
> > up
> > > to
> > > > one disk failure and it takes time to rebuild disk after one disk
> > > > failure. RAID 5 implementations are susceptible to system failures
> > > because
> > > > of trends regarding array rebuild time and the chance of drive
> failure
> > > > during rebuild. There is no such performance degradation for JBOD and
> > > JBOD
> > > > can support multiple log directory failure without reducing
> performance
> > > of
> > > > good log directories. Would this be a reasonable reason for using
> JBOD
> > > > instead of RAID-5/6?
> > > >
> > > > Previously we discussed wether broker should remove offline replica
> > from
> > > > replica fetcher thread. I still think it should do it instead of
> > > printing a
> > > > lot of error in the log4j log. We can still let controller send
> > > > StopReplicaRequest to the broker. I am not sure I undertand why
> > allowing
> > > > broker to remove offline replica from fetcher thread will increase
> > churns
> > > > in ISR. Do you think this is concern with this approach?
> > > >
> > > > I have updated the KIP to remove created flag from ZK and change the
> > > filed
> > > > name to isNewReplica. Can you check if there is any issue with the
> > latest
> > > > KIP? Thanks for your time!
> > > >
> > > > Regards,
> > > > Dong
> > > >
> > > >
> > > > On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hi, Dong,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > Personally, I'd prefer not to write the created flag per replica in
> > ZK.
> > > > > Your suggestion of disabling replica creation if there is a bad log
> > > > > directory on the broker could work. The only thing is that it may
> > delay
> > > > the
> > > > > creation of new replicas. I was thinking that an alternative is to
> > > extend
> > > > > LeaderAndIsrRequest by adding a isNewReplica field per replica.
> That
> > > > field
> > > > > will be set when a replica is transitioning from the NewReplica
> state
> > > to
> > > > > Online state. Then, when a broker receives a LeaderAndIsrRequest,
> if
> > a
> > > > > replica is marked as the new replica, it will be created on a good
> > log
> > > > > directory, if not already present. Otherwise, it only creates the
> > > replica
> > > > > if all log directories are good and the replica is not already
> > present.
> > > > > This way, we don't delay the processing of new replicas in the
> common
> > > > case.
> > > > >
> > > > > I am ok with not persisting the offline replicas in ZK and just
> > > > discovering
> > > > > them through the LeaderAndIsrRequest. It handles the cases when a
> > > broker
> > > > > starts up with bad log directories better. So, the additional
> > overhead
> > > of
> > > > > rediscovering the offline replicas is justified.
> > > > >
> > > > >
> > > > > Another high level question. The proposal rejected RAID5/6 since it
> > > adds
> > > > > additional I/Os. The main issue with RAID5 is that to write a block
> > > that
> > > > > doesn't match the RAID stripe size, we have to first read the old
> > > parity
> > > > to
> > > > > compute the new one, which increases the number of I/Os (
> > > > > http://rickardnobel.se/raid-5-write-penalty/). I am wondering if
> you
> > > > have
> > > > > tested RAID5's performance by creating a file system whose block
> size
> > > > > matches the RAID stripe size (https://www.percona.com/blog/
> > > > > 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing
> a
> > > > block
> > > > > doesn't require a read first. A large block size may increase the
> > > amount
> > > > of
> > > > > data writes, when the same block has to be written to disk multiple
> > > > times.
> > > > > However, this is probably ok in Kafka's use case since we batch the
> > I/O
> > > > > flush already. As you can see, we will be adding some complexity to
> > > > support
> > > > > JBOD in Kafka one way or another. If we can tune the performance of
> > > RAID5
> > > > > to match that of RAID10, perhaps using RAID5 is a simpler solution.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > I don't think we should allow failed replicas to be re-created on
> > the
> > > > > good
> > > > > > disks. Say there are 2 disks and each of them is 51% loaded. If
> any
> > > > disk
> > > > > > fail, and we allow replicas to be re-created on the other disks,
> > both
> > > > > disks
> > > > > > will fail. Alternatively we can disable replica creation if there
> > is
> > > > bad
> > > > > > disk on a broker. I personally think it is worth the additional
> > > > > complexity
> > > > > > in the broker to store created replicas in ZK so that we allow
> new
> > > > > replicas
> > > > > > to be created on the broker even when there is bad log directory.
> > > This
> > > > > > approach won't add complexity in the controller. But I am fine
> with
> > > > > > disabling replica creation when there is bad log directory that
> if
> > it
> > > > is
> > > > > > the only blocking issue for this KIP.
> > > > > >
> > > > > > Whether we store created flags is independent of whether/how we
> > store
> > > > > > offline replicas. Per our previous discussion, do you think it is
> > OK
> > > > not
> > > > > > store offline replicas in ZK and propagate the offline replicas
> > from
> > > > > broker
> > > > > > to controller via LeaderAndIsrRequest?
> > > > > >
> > > > > > Thanks,
> > > > > > Dong
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to