Hey Jay, Sriram, Great point. If I understand you right, you are suggesting that we can simply use RAID-0 so that the load can be evenly distributed across disks. And even though a disk failure will bring down the enter broker, the reduced availability as compared to using KIP-112 and KIP-113 will may be negligible. And it may be better to just accept the slightly reduced availability instead of introducing the complexity from KIP-112 and KIP-113.
Let's assume the following: - There are 30 brokers in a cluster and each broker has 10 disks - The replication factor is 3 and min.isr = 2. - The probability of annual disk failure rate is 2% according to this <https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/> blog. - It takes 3 days to replace a disk. Here is my calculation for probability of data loss due to disk failure: probability of a given disk fails in a given year: 2% probability of a given disk stays offline for one day in a given day: 2% / 365 * 3 probability of a given broker stays offline for one day in a given day due to disk failure: 2% / 365 * 3 * 10 probability of any broker stays offline for one day in a given day due to disk failure: 2% / 365 * 3 * 10 * 30 = 5% probability of any three broker stays offline for one day in a given day due to disk failure: 5% * 5% * 5% = 0.0125% probability of data loss due to disk failure: 0.0125% Here is my calculation for probability of service unavailability due to disk failure: probability of a given disk fails in a given year: 2% probability of a given disk stays offline for one day in a given day: 2% / 365 * 3 probability of a given broker stays offline for one day in a given day due to disk failure: 2% / 365 * 3 * 10 probability of any broker stays offline for one day in a given day due to disk failure: 2% / 365 * 3 * 10 * 30 = 5% probability of any two broker stays offline for one day in a given day due to disk failure: 5% * 5% * 5% = 0.25% probability of unavailability due to disk failure: 0.25% Note that the unavailability due to disk failure will be unacceptably high in this case. And the probability of data loss due to disk failure will be higher than 0.01%. Neither is acceptable if Kafka is intended to achieve four nigh availability. Thanks, Dong On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io> wrote: > I think Ram's point is that in place failure is pretty complicated, and > this is meant to be a cost saving feature, we should construct an argument > for it grounded in data. > > Assume an annual failure rate of 1% (reasonable, but data is available > online), and assume it takes 3 days to get the drive replaced. Say you have > 10 drives per server. Then the expected downtime for each server is roughly > 1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm ignoring > the case of multiple failures, but I don't know that changes it much). So > the savings from this feature is 0.3/365 = 0.08%. Say you have 1000 servers > and they cost $3000/year fully loaded including power, the cost of the hw > amortized over it's life, etc. Then this feature saves you $3000 on your > total server cost of $3m which seems not very worthwhile compared to other > optimizations...? > > Anyhow, not sure the arithmetic is right there, but i think that is the > type of argument that would be helpful to think about the tradeoff in > complexity. > > -Jay > > > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> wrote: > > > Hey Sriram, > > > > Thanks for taking time to review the KIP. Please see below my answers to > > your questions: > > > > >1. Could you pick a hardware/Kafka configuration and go over what is the > > >average disk/partition repair/restore time that we are targeting for a > > >typical JBOD setup? > > > > We currently don't have this data. I think the disk/partition > repair/store > > time depends on availability of hardware, the response time of > > site-reliability engineer, the amount of data on the bad disk etc. These > > vary between companies and even clusters within the same company and it > is > > probably hard to determine what is the average situation. > > > > I am not very sure why we need this. Can you explain a bit why this data > is > > useful to evaluate the motivation and design of this KIP? > > > > >2. How often do we believe disks are going to fail (in your example > > >configuration) and what do we gain by avoiding the network overhead and > > >doing all the work of moving the replica within the broker to another > disk > > >instead of balancing it globally? > > > > I think the chance of disk failure depends mainly on the disk itself > rather > > than the broker configuration. I don't have this data now. I will ask our > > SRE whether they know the mean-time-to-fail for our disk. What I was told > > by SRE is that disk failure is the most common type of hardware failure. > > > > When there is disk failure, I think it is reasonable to move replica to > > another broker instead of another disk on the same broker. The reason we > > want to move replica within broker is mainly to optimize the Kafka > cluster > > performance when we balance load across disks. > > > > In comparison to balancing replicas globally, the benefit of moving > replica > > within broker is that: > > > > 1) the movement is faster since it doesn't go through socket or rely on > the > > available network bandwidth; > > 2) much less impact on the replication traffic between broker by not > taking > > up bandwidth between brokers. Depending on the pattern of traffic, we may > > need to balance load across disk frequently and it is necessary to > prevent > > this operation from slowing down the existing operation (e.g. produce, > > consume, replication) in the Kafka cluster. > > 3) It gives us opportunity to do automatic broker rebalance between disks > > on the same broker. > > > > > > >3. Even if we had to move the replica within the broker, why cannot we > > just > > >treat it as another replica and have it go through the same replication > > >code path that we have today? The downside here is obviously that you > need > > >to catchup from the leader but it is completely free! What do we think > is > > >the impact of the network overhead in this case? > > > > Good point. My initial proposal actually used the existing > > ReplicaFetcherThread (i.e. the existing code path) to move replica > between > > disks. However, I switched to use separate thread pool after discussion > > with Jun and Becket. > > > > The main argument for using separate thread pool is to actually keep the > > design simply and easy to reason about. There are a number of difference > > between inter-broker replication and intra-broker replication which makes > > it cleaner to do them in separate code path. I will list them below: > > > > - The throttling mechanism for inter-broker replication traffic and > > intra-broker replication traffic is different. For example, we may want > to > > specify per-topic quota for inter-broker replication traffic because we > may > > want some topic to be moved faster than other topic. But we don't care > > about priority of topics for intra-broker movement. So the current > proposal > > only allows user to specify per-broker quota for inter-broker replication > > traffic. > > > > - The quota value for inter-broker replication traffic and intra-broker > > replication traffic is different. The available bandwidth for > inter-broker > > replication can probably be much higher than the bandwidth for > inter-broker > > replication. > > > > - The ReplicaFetchThread is per broker. Intuitively, the number of > threads > > doing intra broker data movement should be related to the number of disks > > in the broker, not the number of brokers in the cluster. > > > > - The leader replica has no ReplicaFetchThread to start with. It seems > > weird to > > start one just for intra-broker replication. > > > > Because of these difference, we think it is simpler to use separate > thread > > pool and code path so that we can configure and throttle them separately. > > > > > > >4. What are the chances that we will be able to identify another disk to > > >balance within the broker instead of another disk on another broker? If > we > > >have 100's of machines, the probability of finding a better balance by > > >choosing another broker is much higher than balancing within the broker. > > >Could you add some info on how we are determining this? > > > > It is possible that we can find available space on a remote broker. The > > benefit of allowing intra-broker replication is that, when there are > > available space in both the current broker and a remote broker, the > > rebalance can be completed faster with much less impact on the > inter-broker > > replication or the users traffic. It is about taking advantage of > locality > > when balance the load. > > > > >5. Finally, in a cloud setup where more users are going to leverage a > > >shared filesystem (example, EBS in AWS), all this change is not of much > > >gain since you don't need to balance between the volumes within the same > > >broker. > > > > You are right. This KIP-113 is useful only if user uses JBOD. If user > uses > > an extra storage layer of replication, such as RAID-10 or EBS, they don't > > need KIP-112 or KIP-113. Note that user will replicate data more times > than > > the replication factor of the Kafka topic if an extra storage layer of > > replication is used. > > >