Unfortunately RAID-5/6 is not typically advised anymore due to failure issues, as Dong mentions, e.g.: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/ <http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/>
Eno > On 27 Feb 2017, at 21:16, Jun Rao <j...@confluent.io> wrote: > > Hi, Dong, > > For RAID5, I am not sure the rebuild cost is a big concern. If a disk > fails, typically an admin has to bring down the broker, replace the failed > disk with a new one, trigger the RAID rebuild, and bring up the broker. > This way, there is no performance impact at runtime due to rebuild. The > benefit is that a broker doesn't fail in a hard way when there is a disk > failure and can be brought down in a controlled way for maintenance. While > the broker is running with a failed disk, reads may be more expensive since > they have to be computed from the parity. However, if most reads are from > page cache, this may not be a big issue either. So, it would be useful to > do some tests on RAID5 before we completely rule it out. > > Regarding whether to remove an offline replica from the fetcher thread > immediately. What do we do when a failed replica is a leader? Do we do > nothing or mark the replica as not the leader immediately? Intuitively, it > seems it's better if the broker acts consistently on a failed replica > whether it's a leader or a follower. For ISR churns, I was just pointing > out that if we don't send StopReplicaRequest to a broker to be shut down in > a controlled way, then the leader will shrink ISR, expand it and shrink it > again after the timeout. > > The KIP seems to still reference " > /broker/topics/[topic]/partitions/[partitionId]/controller_managed_state". > > Thanks, > > Jun > > On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote: > >> Hey Jun, >> >> Thanks for the suggestion. I think it is a good idea to know put created >> flag in ZK and simply specify isNewReplica=true in LeaderAndIsrRequest if >> repilcas was in NewReplica state. It will only fail the replica creation in >> the scenario that the controller fails after >> topic-creation/partition-reassignment/partition-number-change but before >> actually sends out the LeaderAndIsrRequest while there is ongoing disk >> failure, which should be pretty rare and acceptable. This should simplify >> the design of this KIP. >> >> Regarding RAID-5, I think the concern with RAID-5/6 is not just about >> performance when there is no failure. For example, RAID-5 can support up to >> one disk failure and it takes time to rebuild disk after one disk >> failure. RAID 5 implementations are susceptible to system failures because >> of trends regarding array rebuild time and the chance of drive failure >> during rebuild. There is no such performance degradation for JBOD and JBOD >> can support multiple log directory failure without reducing performance of >> good log directories. Would this be a reasonable reason for using JBOD >> instead of RAID-5/6? >> >> Previously we discussed wether broker should remove offline replica from >> replica fetcher thread. I still think it should do it instead of printing a >> lot of error in the log4j log. We can still let controller send >> StopReplicaRequest to the broker. I am not sure I undertand why allowing >> broker to remove offline replica from fetcher thread will increase churns >> in ISR. Do you think this is concern with this approach? >> >> I have updated the KIP to remove created flag from ZK and change the filed >> name to isNewReplica. Can you check if there is any issue with the latest >> KIP? Thanks for your time! >> >> Regards, >> Dong >> >> >> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote: >> >>> Hi, Dong, >>> >>> Thanks for the reply. >>> >>> Personally, I'd prefer not to write the created flag per replica in ZK. >>> Your suggestion of disabling replica creation if there is a bad log >>> directory on the broker could work. The only thing is that it may delay >> the >>> creation of new replicas. I was thinking that an alternative is to extend >>> LeaderAndIsrRequest by adding a isNewReplica field per replica. That >> field >>> will be set when a replica is transitioning from the NewReplica state to >>> Online state. Then, when a broker receives a LeaderAndIsrRequest, if a >>> replica is marked as the new replica, it will be created on a good log >>> directory, if not already present. Otherwise, it only creates the replica >>> if all log directories are good and the replica is not already present. >>> This way, we don't delay the processing of new replicas in the common >> case. >>> >>> I am ok with not persisting the offline replicas in ZK and just >> discovering >>> them through the LeaderAndIsrRequest. It handles the cases when a broker >>> starts up with bad log directories better. So, the additional overhead of >>> rediscovering the offline replicas is justified. >>> >>> >>> Another high level question. The proposal rejected RAID5/6 since it adds >>> additional I/Os. The main issue with RAID5 is that to write a block that >>> doesn't match the RAID stripe size, we have to first read the old parity >> to >>> compute the new one, which increases the number of I/Os ( >>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if you >> have >>> tested RAID5's performance by creating a file system whose block size >>> matches the RAID stripe size (https://www.percona.com/blog/ >>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing a >> block >>> doesn't require a read first. A large block size may increase the amount >> of >>> data writes, when the same block has to be written to disk multiple >> times. >>> However, this is probably ok in Kafka's use case since we batch the I/O >>> flush already. As you can see, we will be adding some complexity to >> support >>> JBOD in Kafka one way or another. If we can tune the performance of RAID5 >>> to match that of RAID10, perhaps using RAID5 is a simpler solution. >>> >>> Thanks, >>> >>> Jun >>> >>> >>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com> wrote: >>> >>>> Hey Jun, >>>> >>>> I don't think we should allow failed replicas to be re-created on the >>> good >>>> disks. Say there are 2 disks and each of them is 51% loaded. If any >> disk >>>> fail, and we allow replicas to be re-created on the other disks, both >>> disks >>>> will fail. Alternatively we can disable replica creation if there is >> bad >>>> disk on a broker. I personally think it is worth the additional >>> complexity >>>> in the broker to store created replicas in ZK so that we allow new >>> replicas >>>> to be created on the broker even when there is bad log directory. This >>>> approach won't add complexity in the controller. But I am fine with >>>> disabling replica creation when there is bad log directory that if it >> is >>>> the only blocking issue for this KIP. >>>> >>>> Whether we store created flags is independent of whether/how we store >>>> offline replicas. Per our previous discussion, do you think it is OK >> not >>>> store offline replicas in ZK and propagate the offline replicas from >>> broker >>>> to controller via LeaderAndIsrRequest? >>>> >>>> Thanks, >>>> Dong >>>> >>> >>