Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Jun Rao Sat, 25 Feb 2017 09:12:52 -0800

Hi, Dong,

Thanks for the reply.

Personally, I'd prefer not to write the created flag per replica in ZK.
Your suggestion of disabling replica creation if there is a bad log
directory on the broker could work. The only thing is that it may delay the
creation of new replicas. I was thinking that an alternative is to extend
LeaderAndIsrRequest by adding a isNewReplica field per replica. That field
will be set when a replica is transitioning from the NewReplica state to
Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
replica is marked as the new replica, it will be created on a good log
directory, if not already present. Otherwise, it only creates the replica
if all log directories are good and the replica is not already present.
This way, we don't delay the processing of new replicas in the common case.

I am ok with not persisting the offline replicas in ZK and just discovering
them through the LeaderAndIsrRequest. It handles the cases when a broker
starts up with bad log directories better. So, the additional overhead of
rediscovering the offline replicas is justified.

Another high level question. The proposal rejected RAID5/6 since it adds
additional I/Os. The main issue with RAID5 is that to write a block that
doesn't match the RAID stripe size, we have to first read the old parity to
compute the new one, which increases the number of I/Os (
http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you have
tested RAID5's performance by creating a file system whose block size
matches the RAID stripe size (https://www.percona.com/blog/
2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a block
doesn't require a read first. A large block size may increase the amount of
data writes, when the same block has to be written to disk multiple times.
However, this is probably ok in Kafka's use case since we batch the I/O
flush already. As you can see, we will be adding some complexity to support
JBOD in Kafka one way or another. If we can tune the performance of RAID5
to match that of RAID10, perhaps using RAID5 is a simpler solution.

Thanks,

Jun

On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Jun,
>
> I don't think we should allow failed replicas to be re-created on the good
> disks. Say there are 2 disks and each of them is 51% loaded. If any disk
> fail, and we allow replicas to be re-created on the other disks, both disks
> will fail. Alternatively we can disable replica creation if there is bad
> disk on a broker. I personally think it is worth the additional complexity
> in the broker to store created replicas in ZK so that we allow new replicas
> to be created on the broker even when there is bad log directory. This
> approach won't add complexity in the controller. But I am fine with
> disabling replica creation when there is bad log directory that if it is
> the only blocking issue for this KIP.
>
> Whether we store created flags is independent of whether/how we store
> offline replicas. Per our previous discussion, do you think it is OK not
> store offline replicas in ZK and propagate the offline replicas from broker
> to controller via LeaderAndIsrRequest?
>
> Thanks,
> Dong
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to