All: Solid state drives
Depreciation - straightline vs accelerated Cost of capital - present value versus future value of the dollar Outright buy or loan from from bank(interest rate of loan) Frequency of the SATA drive failure Cost of the server farm - office space, raised floors Energy consumption of the server farm Cost of racks 42U, networking Ethernet 10 Gig 1Gig E Primary data center vs disaster recovery (DR) On-prem vs Google Cloud, Microsoft Azure vs Amazon AWS Cosr of labor on-prem - salaries dev ops personnel - M -----Original Message----- From: Jay Kreps [mailto:j...@confluent.io] Sent: Wednesday, June 07, 2017 2:27 AM To: dev@kafka.apache.org Subject: Re: [DISCUSS] KIP-113: Support replicas movement between log directories I think Ram's point is that in place failure is pretty complicated, and this is meant to be a cost saving feature, we should construct an argument for it grounded in data. Assume an annual failure rate of 1% (reasonable, but data is available online), and assume it takes 3 days to get the drive replaced. Say you have 10 drives per server. Then the expected downtime for each server is roughly 1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm ignoring the case of multiple failures, but I don't know that changes it much). So the savings from this feature is 0.3/365 = 0.08%. Say you have 1000 servers and they cost $3000/year fully loaded including power, the cost of the hw amortized over it's life, etc. Then this feature saves you $3000 on your total server cost of $3m which seems not very worthwhile compared to other optimizations...? Anyhow, not sure the arithmetic is right there, but i think that is the type of argument that would be helpful to think about the tradeoff in complexity. -Jay On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> wrote: > Hey Sriram, > > Thanks for taking time to review the KIP. Please see below my answers > to your questions: > > >1. Could you pick a hardware/Kafka configuration and go over what is > >the average disk/partition repair/restore time that we are targeting > >for a typical JBOD setup? > > We currently don't have this data. I think the disk/partition > repair/store time depends on availability of hardware, the response > time of site-reliability engineer, the amount of data on the bad disk > etc. These vary between companies and even clusters within the same > company and it is probably hard to determine what is the average situation. > > I am not very sure why we need this. Can you explain a bit why this > data is useful to evaluate the motivation and design of this KIP? > > >2. How often do we believe disks are going to fail (in your example > >configuration) and what do we gain by avoiding the network overhead > >and doing all the work of moving the replica within the broker to > >another disk instead of balancing it globally? > > I think the chance of disk failure depends mainly on the disk itself > rather than the broker configuration. I don't have this data now. I > will ask our SRE whether they know the mean-time-to-fail for our disk. > What I was told by SRE is that disk failure is the most common type of > hardware failure. > > When there is disk failure, I think it is reasonable to move replica > to another broker instead of another disk on the same broker. The > reason we want to move replica within broker is mainly to optimize the > Kafka cluster performance when we balance load across disks. > > In comparison to balancing replicas globally, the benefit of moving > replica within broker is that: > > 1) the movement is faster since it doesn't go through socket or rely > on the available network bandwidth; > 2) much less impact on the replication traffic between broker by not > taking up bandwidth between brokers. Depending on the pattern of > traffic, we may need to balance load across disk frequently and it is > necessary to prevent this operation from slowing down the existing > operation (e.g. produce, consume, replication) in the Kafka cluster. > 3) It gives us opportunity to do automatic broker rebalance between > disks on the same broker. > > > >3. Even if we had to move the replica within the broker, why cannot > >we > just > >treat it as another replica and have it go through the same > >replication code path that we have today? The downside here is > >obviously that you need to catchup from the leader but it is > >completely free! What do we think is the impact of the network overhead in > >this case? > > Good point. My initial proposal actually used the existing > ReplicaFetcherThread (i.e. the existing code path) to move replica > between disks. However, I switched to use separate thread pool after > discussion with Jun and Becket. > > The main argument for using separate thread pool is to actually keep > the design simply and easy to reason about. There are a number of > difference between inter-broker replication and intra-broker > replication which makes it cleaner to do them in separate code path. I will > list them below: > > - The throttling mechanism for inter-broker replication traffic and > intra-broker replication traffic is different. For example, we may > want to specify per-topic quota for inter-broker replication traffic > because we may want some topic to be moved faster than other topic. > But we don't care about priority of topics for intra-broker movement. > So the current proposal only allows user to specify per-broker quota > for inter-broker replication traffic. > > - The quota value for inter-broker replication traffic and > intra-broker replication traffic is different. The available bandwidth > for inter-broker replication can probably be much higher than the > bandwidth for inter-broker replication. > > - The ReplicaFetchThread is per broker. Intuitively, the number of > threads doing intra broker data movement should be related to the > number of disks in the broker, not the number of brokers in the cluster. > > - The leader replica has no ReplicaFetchThread to start with. It seems > weird to start one just for intra-broker replication. > > Because of these difference, we think it is simpler to use separate > thread pool and code path so that we can configure and throttle them > separately. > > > >4. What are the chances that we will be able to identify another disk > >to balance within the broker instead of another disk on another > >broker? If we have 100's of machines, the probability of finding a > >better balance by choosing another broker is much higher than balancing > >within the broker. > >Could you add some info on how we are determining this? > > It is possible that we can find available space on a remote broker. > The benefit of allowing intra-broker replication is that, when there > are available space in both the current broker and a remote broker, > the rebalance can be completed faster with much less impact on the > inter-broker replication or the users traffic. It is about taking > advantage of locality when balance the load. > > >5. Finally, in a cloud setup where more users are going to leverage a > >shared filesystem (example, EBS in AWS), all this change is not of > >much gain since you don't need to balance between the volumes within > >the same broker. > > You are right. This KIP-113 is useful only if user uses JBOD. If user > uses an extra storage layer of replication, such as RAID-10 or EBS, > they don't need KIP-112 or KIP-113. Note that user will replicate data > more times than the replication factor of the Kafka topic if an extra > storage layer of replication is used. >