Thanks Todd for the explanation.

Eno
> On 28 Feb 2017, at 18:15, Todd Palino <tpal...@gmail.com> wrote:
> 
> We have tested RAID 5/6 in the past (and recently) and found it to be
> lacking. So, as noted, rebuild takes more time than RAID 10 because all the
> disks need to be accessed to recalculate parity. In addition, there’s a
> significant performance loss just in normal operations. It’s been a while
> since I ran those tests, but it was in the 30-50% range - nothing to shrug
> off. We didn’t even get to failure testing because of that.
> 
> Jun - to your question, we ran the tests with numerous combinations of
> block sizes and FS parameters. The performance varied, but it was never
> good enough to warrant more than a superficial look at using RAID 5/6. We
> also tested both software RAID and hardware RAID.
> 
> As far as the operational concerns around broker-per-disk and
> broker-per-server, we’ve been talking about this internally. Running one
> broker per disk adds a good bit of administrative overhead and complexity.
> If you perform a one by one rolling bounce of the cluster, you’re talking
> about a 10x increase in time. That means a cluster that restarts in 30
> minutes now takes 5 hours. If you try and optimize this by shutting down
> all the brokers on one host at a time, you can get close to the original
> number, but you now have added operational complexity by having to
> micro-manage the bounce. The broker count increase will percolate down to
> the rest of the administrative domain as well - maintaining ports for all
> the instances, monitoring more instances, managing configs, etc.
> 
> You also have the overhead of running the extra processes - extra heap,
> task switching, etc. We don’t have a problem with page cache really, since
> the VM subsystem is fairly efficient about how it works. But just because
> cache works doesn’t mean we’re not wasting other resources. And that gets
> pushed downstream to clients as well, because they all have to maintain
> more network connections and the resources that go along with it.
> 
> Running more brokers in a cluster also exposes you to more corner cases and
> race conditions within the Kafka code. Bugs in the brokers, bugs in the
> controllers, more complexity in balancing load in a cluster (though trying
> to balance load across disks in a single broker doing JBOD negates that).
> 
> -Todd
> 
> 
> On Tue, Feb 28, 2017 at 9:40 AM, Jun Rao <j...@confluent.io> wrote:
> 
>> Hi, Dong,
>> 
>> RAID6 is an improvement over RAID5 and can tolerate 2 disks failure. Eno's
>> point is that the rebuild of RAID5/RAID6 requires reading more data
>> compared with RAID10, which increases the probability of error during
>> rebuild. This makes sense. In any case, do you think you could ask the SREs
>> at LinkedIn to share their opinions on RAID5/RAID6?
>> 
>> Yes, when a replica is offline due to a bad disk, it makes sense to handle
>> it immediately as if a StopReplicaRequest is received (i.e., replica is no
>> longer considered a leader and is removed from any replica fetcher thread).
>> Could you add that detail in item 2. in the wiki?
>> 
>> 50. The wiki says "Broker assumes a log directory to be good after it
>> starts" : A log directory actually could be bad during startup.
>> 
>> 51. In item 4, the wiki says "The controller watches the path
>> /log_dir_event_notification for new znode.". This doesn't seem be needed
>> now?
>> 
>> 52. The isNewReplica field in LeaderAndIsrRequest should be for each
>> replica inside the replicas field, right?
>> 
>> Other than those, the current KIP looks good to me. Do you want to start a
>> separate discussion thread on KIP-113? I do have some comments there.
>> 
>> Thanks for working on this!
>> 
>> Jun
>> 
>> 
>> On Mon, Feb 27, 2017 at 5:51 PM, Dong Lin <lindon...@gmail.com> wrote:
>> 
>>> Hi Jun,
>>> 
>>> In addition to the Eno's reference of why rebuild time with RAID-5 is
>> more
>>> expensive, another concern is that RAID-5 will fail if more than one disk
>>> fails. JBOD is still works with 1+ disk failure and has better
>> performance
>>> with one disk failure. These seems like good argument for using JBOD
>>> instead of RAID-5.
>>> 
>>> If a leader replica goes offline, the broker should first take all
>> actions
>>> (i.e. remove the partition from fetcher thread) as if it has received
>>> StopReplicaRequest for this partition because the replica can no longer
>>> work anyway. It will also respond with error to any ProduceRequest and
>>> FetchRequest for partition. The broker notifies controller by writing
>>> notification znode in ZK. The controller learns the disk failure event
>> from
>>> ZK, sends LeaderAndIsrRequest and receives LeaderAndIsrResponse to learn
>>> that the replica is offline. The controller will then elect new leader
>> for
>>> this partition and sends LeaderAndIsrRequest/MetadataUpdateRequest to
>>> relevant brokers. The broker should stop adjusting the ISR for this
>>> partition as if the broker is already offline. I am not sure there is any
>>> inconsistency in broker's behavior when it is leader or follower. Is
>> there
>>> any concern with this approach?
>>> 
>>> Thanks for catching this. I have removed that reference from the KIP.
>>> 
>>> Hi Eno,
>>> 
>>> Thank you for providing the reference of the RAID-5. In LinkedIn we have
>> 10
>>> disks per Kafka machine. It will not be a show-stopper operationally for
>>> LinkedIn if we have to deploy one-broker-per-disk. On the other hand we
>>> previously discussed the advantage of JBOD vs. one-broker-per-disk or
>>> one-broker-per-machine. One-broker-per-disk suffers from the problems
>>> described in the KIP and one-broker-per-machine increases the failure
>>> caused by disk failure by 10X. Since JBOD is strictly better than either
>> of
>>> the two, it is also better then one-broker-per-multiple-disk which is
>>> somewhere between one-broker-per-disk and one-broker-per-machine.
>>> 
>>> I personally think the benefits of JBOD design is worth the
>> implementation
>>> complexity it introduces. I would also argue that it is reasonable for
>>> Kafka to manage this low level detail because Kafka is already exposing
>> and
>>> managing replication factor of its data. But whether the complexity is
>>> worthwhile can be subjective and I can not prove my opinion. I am
>>> contributing significant amount of time to do this KIP because Kafka
>>> develops at LinkedIn believes it is useful and worth the effort. Yeah, it
>>> will be useful to see what everyone else think about it.
>>> 
>>> 
>>> Thanks,
>>> Dong
>>> 
>>> 
>>> On Mon, Feb 27, 2017 at 1:16 PM, Jun Rao <j...@confluent.io> wrote:
>>> 
>>>> Hi, Dong,
>>>> 
>>>> For RAID5, I am not sure the rebuild cost is a big concern. If a disk
>>>> fails, typically an admin has to bring down the broker, replace the
>>> failed
>>>> disk with a new one, trigger the RAID rebuild, and bring up the broker.
>>>> This way, there is no performance impact at runtime due to rebuild. The
>>>> benefit is that a broker doesn't fail in a hard way when there is a
>> disk
>>>> failure and can be brought down in a controlled way for maintenance.
>>> While
>>>> the broker is running with a failed disk, reads may be more expensive
>>> since
>>>> they have to be computed from the parity. However, if most reads are
>> from
>>>> page cache, this may not be a big issue either. So, it would be useful
>> to
>>>> do some tests on RAID5 before we completely rule it out.
>>>> 
>>>> Regarding whether to remove an offline replica from the fetcher thread
>>>> immediately. What do we do when a failed replica is a leader? Do we do
>>>> nothing or mark the replica as not the leader immediately? Intuitively,
>>> it
>>>> seems it's better if the broker acts consistently on a failed replica
>>>> whether it's a leader or a follower. For ISR churns, I was just
>> pointing
>>>> out that if we don't send StopReplicaRequest to a broker to be shut
>> down
>>> in
>>>> a controlled way, then the leader will shrink ISR, expand it and shrink
>>> it
>>>> again after the timeout.
>>>> 
>>>> The KIP seems to still reference "
>>>> /broker/topics/[topic]/partitions/[partitionId]/
>>> controller_managed_state".
>>>> 
>>>> Thanks,
>>>> 
>>>> Jun
>>>> 
>>>> On Sat, Feb 25, 2017 at 7:49 PM, Dong Lin <lindon...@gmail.com> wrote:
>>>> 
>>>>> Hey Jun,
>>>>> 
>>>>> Thanks for the suggestion. I think it is a good idea to know put
>>> created
>>>>> flag in ZK and simply specify isNewReplica=true in
>> LeaderAndIsrRequest
>>> if
>>>>> repilcas was in NewReplica state. It will only fail the replica
>>> creation
>>>> in
>>>>> the scenario that the controller fails after
>>>>> topic-creation/partition-reassignment/partition-number-change but
>>> before
>>>>> actually sends out the LeaderAndIsrRequest while there is ongoing
>> disk
>>>>> failure, which should be pretty rare and acceptable. This should
>>> simplify
>>>>> the design of this KIP.
>>>>> 
>>>>> Regarding RAID-5, I think the concern with RAID-5/6 is not just about
>>>>> performance when there is no failure. For example, RAID-5 can support
>>> up
>>>> to
>>>>> one disk failure and it takes time to rebuild disk after one disk
>>>>> failure. RAID 5 implementations are susceptible to system failures
>>>> because
>>>>> of trends regarding array rebuild time and the chance of drive
>> failure
>>>>> during rebuild. There is no such performance degradation for JBOD and
>>>> JBOD
>>>>> can support multiple log directory failure without reducing
>> performance
>>>> of
>>>>> good log directories. Would this be a reasonable reason for using
>> JBOD
>>>>> instead of RAID-5/6?
>>>>> 
>>>>> Previously we discussed wether broker should remove offline replica
>>> from
>>>>> replica fetcher thread. I still think it should do it instead of
>>>> printing a
>>>>> lot of error in the log4j log. We can still let controller send
>>>>> StopReplicaRequest to the broker. I am not sure I undertand why
>>> allowing
>>>>> broker to remove offline replica from fetcher thread will increase
>>> churns
>>>>> in ISR. Do you think this is concern with this approach?
>>>>> 
>>>>> I have updated the KIP to remove created flag from ZK and change the
>>>> filed
>>>>> name to isNewReplica. Can you check if there is any issue with the
>>> latest
>>>>> KIP? Thanks for your time!
>>>>> 
>>>>> Regards,
>>>>> Dong
>>>>> 
>>>>> 
>>>>> On Sat, Feb 25, 2017 at 9:11 AM, Jun Rao <j...@confluent.io> wrote:
>>>>> 
>>>>>> Hi, Dong,
>>>>>> 
>>>>>> Thanks for the reply.
>>>>>> 
>>>>>> Personally, I'd prefer not to write the created flag per replica in
>>> ZK.
>>>>>> Your suggestion of disabling replica creation if there is a bad log
>>>>>> directory on the broker could work. The only thing is that it may
>>> delay
>>>>> the
>>>>>> creation of new replicas. I was thinking that an alternative is to
>>>> extend
>>>>>> LeaderAndIsrRequest by adding a isNewReplica field per replica.
>> That
>>>>> field
>>>>>> will be set when a replica is transitioning from the NewReplica
>> state
>>>> to
>>>>>> Online state. Then, when a broker receives a LeaderAndIsrRequest,
>> if
>>> a
>>>>>> replica is marked as the new replica, it will be created on a good
>>> log
>>>>>> directory, if not already present. Otherwise, it only creates the
>>>> replica
>>>>>> if all log directories are good and the replica is not already
>>> present.
>>>>>> This way, we don't delay the processing of new replicas in the
>> common
>>>>> case.
>>>>>> 
>>>>>> I am ok with not persisting the offline replicas in ZK and just
>>>>> discovering
>>>>>> them through the LeaderAndIsrRequest. It handles the cases when a
>>>> broker
>>>>>> starts up with bad log directories better. So, the additional
>>> overhead
>>>> of
>>>>>> rediscovering the offline replicas is justified.
>>>>>> 
>>>>>> 
>>>>>> Another high level question. The proposal rejected RAID5/6 since it
>>>> adds
>>>>>> additional I/Os. The main issue with RAID5 is that to write a block
>>>> that
>>>>>> doesn't match the RAID stripe size, we have to first read the old
>>>> parity
>>>>> to
>>>>>> compute the new one, which increases the number of I/Os (
>>>>>> http://rickardnobel.se/raid-5-write-penalty/). I am wondering if
>> you
>>>>> have
>>>>>> tested RAID5's performance by creating a file system whose block
>> size
>>>>>> matches the RAID stripe size (https://www.percona.com/blog/
>>>>>> 2011/12/16/setting-up-xfs-the-simple-edition/). This way, writing
>> a
>>>>> block
>>>>>> doesn't require a read first. A large block size may increase the
>>>> amount
>>>>> of
>>>>>> data writes, when the same block has to be written to disk multiple
>>>>> times.
>>>>>> However, this is probably ok in Kafka's use case since we batch the
>>> I/O
>>>>>> flush already. As you can see, we will be adding some complexity to
>>>>> support
>>>>>> JBOD in Kafka one way or another. If we can tune the performance of
>>>> RAID5
>>>>>> to match that of RAID10, perhaps using RAID5 is a simpler solution.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jun
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 24, 2017 at 10:17 AM, Dong Lin <lindon...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Hey Jun,
>>>>>>> 
>>>>>>> I don't think we should allow failed replicas to be re-created on
>>> the
>>>>>> good
>>>>>>> disks. Say there are 2 disks and each of them is 51% loaded. If
>> any
>>>>> disk
>>>>>>> fail, and we allow replicas to be re-created on the other disks,
>>> both
>>>>>> disks
>>>>>>> will fail. Alternatively we can disable replica creation if there
>>> is
>>>>> bad
>>>>>>> disk on a broker. I personally think it is worth the additional
>>>>>> complexity
>>>>>>> in the broker to store created replicas in ZK so that we allow
>> new
>>>>>> replicas
>>>>>>> to be created on the broker even when there is bad log directory.
>>>> This
>>>>>>> approach won't add complexity in the controller. But I am fine
>> with
>>>>>>> disabling replica creation when there is bad log directory that
>> if
>>> it
>>>>> is
>>>>>>> the only blocking issue for this KIP.
>>>>>>> 
>>>>>>> Whether we store created flags is independent of whether/how we
>>> store
>>>>>>> offline replicas. Per our previous discussion, do you think it is
>>> OK
>>>>> not
>>>>>>> store offline replicas in ZK and propagate the offline replicas
>>> from
>>>>>> broker
>>>>>>> to controller via LeaderAndIsrRequest?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Dong
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
> 
> 
> 
> linkedin.com/in/toddpalino

Reply via email to