RAID-10's code is much simpler (just stripe plus mirror) and under failure the recovery is much faster since it just has to read from a mirror, not several disks to reconstruct the data. Of course, the price paid is that mirroring is more expensive in terms of storage space.
E.g., see discussion at https://community.spiceworks.com/topic/1155094-raid-10-and-raid-6-is-either-one-really-better-than-the-other <https://community.spiceworks.com/topic/1155094-raid-10-and-raid-6-is-either-one-really-better-than-the-other>. So yes, if you can afford the space, go for RAID-10. If utilising storage space well is what you care about, nothing beats utilising the JBOD disks one-by-one (while replicating at a higher level as Kafka does). However, there is now more complexity for Kafka. Dong, how many disks you you typically expect in a JBOD? 12 or 24 or higher? Are we absolutely sure that running 2-3 brokers/JBOD is a show-stopper operationally? I guess that would increase the rolling restart time (more brokers), but it would be great if we could have a conclusive strong argument against it. I don't have operational experience with Kafka, so I don't have a strong opinion, but is everyone else convinced? Eno > On 27 Feb 2017, at 22:10, Jun Rao <j...@confluent.io> wrote: > > Hi, Eno, > > Thanks for the pointers. Doesn't RAID-10 have a similar issue during > rebuild? In both cases, all data on existing disks have to be read during > rebuild? RAID-10 seems to still be used widely. > > Jun > > On Mon, Feb 27, 2017 at 1:38 PM, Eno Thereska <eno.there...@gmail.com> > wrote: > >> Unfortunately RAID-5/6 is not typically advised anymore due to failure >> issues, as Dong mentions, e.g.: http://www.zdnet.com/article/ >> why-raid-6-stops-working-in-2019/ <http://www.zdnet.com/article/ >> why-raid-6-stops-working-in-2019/> >> >> Eno >> >> >>> On 27 Feb 2017, at 21:16, Jun Rao <j...@confluent.io> wrote: >>> >>> Hi, Dong, >>> >>> For RAID5, I am not sure the rebuild cost is a big concern. If a disk >>> fails, typically an admin has to bring down the broker, replace the >> failed >>> disk with a new one, trigger the RAID rebuild, and bring up the broker. >>> This way, there is no performance impact at runtime due to rebuild. The >>> benefit is that a broker doesn't fail in a hard way when there is a disk >>> failure and can be brought down in a controlled way for maintenance. >> While >>> the broker is running with a failed disk, reads may be more expensive >> since >>> they have to be computed from the parity. However, if most reads are from >>> page cache, this may not be a big issue either. So, it would be useful to >>> do some tests on RAID5 before we completely rule it out. >>> >>> Regarding whether to remove an offline replica from the fetcher thread >>> immediately. What do we do when a failed replica is a leader? Do we do >>> nothing or mark the replica as not the leader immediately? Intuitively, >> it >>> seems it's better if the broker acts consistently on a failed replica >>> whether it's a leader or a follower. For ISR churns, I was just pointing >>> out that if we don't send StopReplicaRequest to a broker to be shut down >> in >>> a controlled way, then the leader will shrink ISR, expand it and shrink >> it >>> again after the timeout. >>> >>> The KIP seems to still reference " >>> /broker/topics/[topic]/partitions/[partitionId]/ >> controller_managed_state". >>> >>> Thanks, >>> >>> Jun >>> >>> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote: >>> >>>> Hey Jun, >>>> >>>> Thanks for the suggestion. I think it is a good idea to know put created >>>> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest >> if >>>> repilcas was in NewReplica state. It will only fail the replica >> creation in >>>> the scenario that the controller fails after >>>> topic-creation/partition-reassignment/partition-number-change but >> before >>>> actually sends out the LeaderAndIsrRequest while there is ongoing disk >>>> failure, which should be pretty rare and acceptable. This should >> simplify >>>> the design of this KIP. >>>> >>>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about >>>> performance when there is no failure. For example, RAID-5 can support >> up to >>>> one disk failure and it takes time to rebuild disk after one disk >>>> failure. RAID 5 implementations are susceptible to system failures >> because >>>> of trends regarding array rebuild time and the chance of drive failure >>>> during rebuild. There is no such performance degradation for JBOD and >> JBOD >>>> can support multiple log directory failure without reducing performance >> of >>>> good log directories. Would this be a reasonable reason for using JBOD >>>> instead of RAID-5/6? >>>> >>>> Previously we discussed wether broker should remove offline replica from >>>> replica fetcher thread. I still think it should do it instead of >> printing a >>>> lot of error in the log4j log. We can still let controller send >>>> StopReplicaRequest to the broker. I am not sure I undertand why allowing >>>> broker to remove offline replica from fetcher thread will increase >> churns >>>> in ISR. Do you think this is concern with this approach? >>>> >>>> I have updated the KIP to remove created flag from ZK and change the >> filed >>>> name to isNewReplica. Can you check if there is any issue with the >> latest >>>> KIP? Thanks for your time! >>>> >>>> Regards, >>>> Dong >>>> >>>> >>>> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote: >>>> >>>>> Hi, Dong, >>>>> >>>>> Thanks for the reply. >>>>> >>>>> Personally, I'd prefer not to write the created flag per replica in ZK. >>>>> Your suggestion of disabling replica creation if there is a bad log >>>>> directory on the broker could work. The only thing is that it may delay >>>> the >>>>> creation of new replicas. I was thinking that an alternative is to >> extend >>>>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That >>>> field >>>>> will be set when a replica is transitioning from the NewReplica state >> to >>>>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a >>>>> replica is marked as the new replica, it will be created on a good log >>>>> directory, if not already present. Otherwise, it only creates the >> replica >>>>> if all log directories are good and the replica is not already present. >>>>> This way, we don't delay the processing of new replicas in the common >>>> case. >>>>> >>>>> I am ok with not persisting the offline replicas in ZK and just >>>> discovering >>>>> them through the LeaderAndIsrRequest. It handles the cases when a >> broker >>>>> starts up with bad log directories better. So, the additional overhead >> of >>>>> rediscovering the offline replicas is justified. >>>>> >>>>> >>>>> Another high level question. The proposal rejected RAID5/6 since it >> adds >>>>> additional I/Os. The main issue with RAID5 is that to write a block >> that >>>>> doesn't match the RAID stripe size, we have to first read the old >> parity >>>> to >>>>> compute the new one, which increases the number of I/Os ( >>>>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you >>>> have >>>>> tested RAID5's performance by creating a file system whose block size >>>>> matches the RAID stripe size (https://www.percona.com/blog/ >>>>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a >>>> block >>>>> doesn't require a read first. A large block size may increase the >> amount >>>> of >>>>> data writes, when the same block has to be written to disk multiple >>>> times. >>>>> However, this is probably ok in Kafka's use case since we batch the I/O >>>>> flush already. As you can see, we will be adding some complexity to >>>> support >>>>> JBOD in Kafka one way or another. If we can tune the performance of >> RAID5 >>>>> to match that of RAID10, perhaps using RAID5 is a simpler solution. >>>>> >>>>> Thanks, >>>>> >>>>> Jun >>>>> >>>>> >>>>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com> >> wrote: >>>>> >>>>>> Hey Jun, >>>>>> >>>>>> I don't think we should allow failed replicas to be re-created on the >>>>> good >>>>>> disks. Say there are 2 disks and each of them is 51% loaded. If any >>>> disk >>>>>> fail, and we allow replicas to be re-created on the other disks, both >>>>> disks >>>>>> will fail. Alternatively we can disable replica creation if there is >>>> bad >>>>>> disk on a broker. I personally think it is worth the additional >>>>> complexity >>>>>> in the broker to store created replicas in ZK so that we allow new >>>>> replicas >>>>>> to be created on the broker even when there is bad log directory. This >>>>>> approach won't add complexity in the controller. But I am fine with >>>>>> disabling replica creation when there is bad log directory that if it >>>> is >>>>>> the only blocking issue for this KIP. >>>>>> >>>>>> Whether we store created flags is independent of whether/how we store >>>>>> offline replicas. Per our previous discussion, do you think it is OK >>>> not >>>>>> store offline replicas in ZK and propagate the offline replicas from >>>>> broker >>>>>> to controller via LeaderAndIsrRequest? >>>>>> >>>>>> Thanks, >>>>>> Dong >>>>>> >>>>> >>>> >> >>