Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Eno Thereska Mon, 27 Feb 2017 14:35:06 -0800

RAID-10's code is much simpler (just stripe plus mirror) and under failure the 
recovery is much faster since it just has to read from a mirror, not several 
disks to reconstruct the data. Of course, the price paid is that mirroring is 
more expensive in terms of storage space.


E.g., see discussion at 
https://community.spiceworks.com/topic/1155094-raid-10-and-raid-6-is-either-one-really-better-than-the-other
 
<https://community.spiceworks.com/topic/1155094-raid-10-and-raid-6-is-either-one-really-better-than-the-other>.
 So yes, if you can afford the space, go for RAID-10.

If utilising storage space well is what you care about, nothing beats utilising 
the JBOD disks one-by-one (while replicating at a higher level as Kafka does). 
However, there is now more complexity for Kafka.

Dong, how many disks you you typically expect in a JBOD? 12 or 24 or higher? 
Are we absolutely sure that running 2-3 brokers/JBOD is a show-stopper 
operationally? I guess that would increase the rolling restart time (more 
brokers), but it would be great if we could have a conclusive strong argument 
against it. I don't have operational experience with Kafka, so I don't have a 
strong opinion, but is everyone else convinced? 

Eno

> On 27 Feb 2017, at 22:10, Jun Rao <j...@confluent.io> wrote:
> 
> Hi, Eno,
> 
> Thanks for the pointers. Doesn't RAID-10 have a similar issue during
> rebuild? In both cases, all data on existing disks have to be read during
> rebuild? RAID-10 seems to still be used widely.
> 
> Jun
> 
> On Mon, Feb 27, 2017 at 1:38 PM, Eno Thereska <eno.there...@gmail.com>
> wrote:
> 
>> Unfortunately RAID-5/6 is not typically advised anymore due to failure
>> issues, as Dong mentions, e.g.: http://www.zdnet.com/article/
>> why-raid-6-stops-working-in-2019/ <http://www.zdnet.com/article/
>> why-raid-6-stops-working-in-2019/>
>> 
>> Eno
>> 
>> 
>>> On 27 Feb 2017, at 21:16, Jun Rao <j...@confluent.io> wrote:
>>> 
>>> Hi, Dong,
>>> 
>>> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
>>> fails, typically an admin has to bring down the broker, replace the
>> failed
>>> disk with a new one, trigger the RAID rebuild, and bring up the broker.
>>> This way, there is no performance impact at runtime due to rebuild. The
>>> benefit is that a broker doesn't fail in a hard way when there is a disk
>>> failure and can be brought down in a controlled way for maintenance.
>> While
>>> the broker is running with a failed disk, reads may be more expensive
>> since
>>> they have to be computed from the parity. However, if most reads are from
>>> page cache, this may not be a big issue either. So, it would be useful to
>>> do some tests on RAID5 before we completely rule it out.
>>> 
>>> Regarding whether to remove an offline replica from the fetcher thread
>>> immediately. What do we do when a failed replica is a leader? Do we do
>>> nothing or mark the replica as not the leader immediately? Intuitively,
>> it
>>> seems it's better if the broker acts consistently on a failed replica
>>> whether it's a leader or a follower. For ISR churns, I was just pointing
>>> out that if we don't send StopReplicaRequest to a broker to be shut down
>> in
>>> a controlled way, then the leader will shrink ISR, expand it and shrink
>> it
>>> again after the timeout.
>>> 
>>> The KIP seems to still reference "
>>> /broker/topics/[topic]/partitions/[partitionId]/
>> controller_managed_state".
>>> 
>>> Thanks,
>>> 
>>> Jun
>>> 
>>> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote:
>>> 
>>>> Hey Jun,
>>>> 
>>>> Thanks for the suggestion. I think it is a good idea to know put created
>>>> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest
>> if
>>>> repilcas was in NewReplica state. It will only fail the replica
>> creation in
>>>> the scenario that the controller fails after
>>>> topic-creation/partition-reassignment/partition-number-change but
>> before
>>>> actually sends out the LeaderAndIsrRequest while there is ongoing disk
>>>> failure, which should be pretty rare and acceptable. This should
>> simplify
>>>> the design of this KIP.
>>>> 
>>>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
>>>> performance when there is no failure. For example, RAID-5 can support
>> up to
>>>> one disk failure and it takes time to rebuild disk after one disk
>>>> failure. RAID 5 implementations are susceptible to system failures
>> because
>>>> of trends regarding array rebuild time and the chance of drive failure
>>>> during rebuild. There is no such performance degradation for JBOD and
>> JBOD
>>>> can support multiple log directory failure without reducing performance
>> of
>>>> good log directories. Would this be a reasonable reason for using JBOD
>>>> instead of RAID-5/6?
>>>> 
>>>> Previously we discussed wether broker should remove offline replica from
>>>> replica fetcher thread. I still think it should do it instead of
>> printing a
>>>> lot of error in the log4j log. We can still let controller send
>>>> StopReplicaRequest to the broker. I am not sure I undertand why allowing
>>>> broker to remove offline replica from fetcher thread will increase
>> churns
>>>> in ISR. Do you think this is concern with this approach?
>>>> 
>>>> I have updated the KIP to remove created flag from ZK and change the
>> filed
>>>> name to isNewReplica. Can you check if there is any issue with the
>> latest
>>>> KIP? Thanks for your time!
>>>> 
>>>> Regards,
>>>> Dong
>>>> 
>>>> 
>>>> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote:
>>>> 
>>>>> Hi, Dong,
>>>>> 
>>>>> Thanks for the reply.
>>>>> 
>>>>> Personally, I'd prefer not to write the created flag per replica in ZK.
>>>>> Your suggestion of disabling replica creation if there is a bad log
>>>>> directory on the broker could work. The only thing is that it may delay
>>>> the
>>>>> creation of new replicas. I was thinking that an alternative is to
>> extend
>>>>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That
>>>> field
>>>>> will be set when a replica is transitioning from the NewReplica state
>> to
>>>>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a
>>>>> replica is marked as the new replica, it will be created on a good log
>>>>> directory, if not already present. Otherwise, it only creates the
>> replica
>>>>> if all log directories are good and the replica is not already present.
>>>>> This way, we don't delay the processing of new replicas in the common
>>>> case.
>>>>> 
>>>>> I am ok with not persisting the offline replicas in ZK and just
>>>> discovering
>>>>> them through the LeaderAndIsrRequest. It handles the cases when a
>> broker
>>>>> starts up with bad log directories better. So, the additional overhead
>> of
>>>>> rediscovering the offline replicas is justified.
>>>>> 
>>>>> 
>>>>> Another high level question. The proposal rejected RAID5/6 since it
>> adds
>>>>> additional I/Os. The main issue with RAID5 is that to write a block
>> that
>>>>> doesn't match the RAID stripe size, we have to first read the old
>> parity
>>>> to
>>>>> compute the new one, which increases the number of I/Os (
>>>>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you
>>>> have
>>>>> tested RAID5's performance by creating a file system whose block size
>>>>> matches the RAID stripe size (https://www.percona.com/blog/
>>>>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a
>>>> block
>>>>> doesn't require a read first. A large block size may increase the
>> amount
>>>> of
>>>>> data writes, when the same block has to be written to disk multiple
>>>> times.
>>>>> However, this is probably ok in Kafka's use case since we batch the I/O
>>>>> flush already. As you can see, we will be adding some complexity to
>>>> support
>>>>> JBOD in Kafka one way or another. If we can tune the performance of
>> RAID5
>>>>> to match that of RAID10, perhaps using RAID5 is a simpler solution.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jun
>>>>> 
>>>>> 
>>>>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Hey Jun,
>>>>>> 
>>>>>> I don't think we should allow failed replicas to be re-created on the
>>>>> good
>>>>>> disks. Say there are 2 disks and each of them is 51% loaded. If any
>>>> disk
>>>>>> fail, and we allow replicas to be re-created on the other disks, both
>>>>> disks
>>>>>> will fail. Alternatively we can disable replica creation if there is
>>>> bad
>>>>>> disk on a broker. I personally think it is worth the additional
>>>>> complexity
>>>>>> in the broker to store created replicas in ZK so that we allow new
>>>>> replicas
>>>>>> to be created on the broker even when there is bad log directory. This
>>>>>> approach won't add complexity in the controller. But I am fine with
>>>>>> disabling replica creation when there is bad log directory that if it
>>>> is
>>>>>> the only blocking issue for this KIP.
>>>>>> 
>>>>>> Whether we store created flags is independent of whether/how we store
>>>>>> offline replicas. Per our previous discussion, do you think it is OK
>>>> not
>>>>>> store offline replicas in ZK and propagate the offline replicas from
>>>>> broker
>>>>>> to controller via LeaderAndIsrRequest?
>>>>>> 
>>>>>> Thanks,
>>>>>> Dong
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to