RE: [DISCUSS] KIP-113: Support replicas movement between log directories

Yu, Mason Wed, 07 Jun 2017 05:44:46 -0700

All:

         Solid state drives


         Depreciation - straightline vs accelerated

         Cost of capital - present value versus future value of the dollar
         Outright buy or loan from from bank(interest rate of loan)

         Frequency of the SATA drive failure

         Cost of the server farm - office space, raised floors

          Energy consumption of the server farm

          Cost of racks 42U, networking Ethernet 10 Gig 1Gig E

          Primary data center vs disaster recovery (DR)

          On-prem vs Google Cloud, Microsoft Azure vs Amazon AWS

          Cosr of labor on-prem - salaries dev ops personnel


          - M

          

-----Original Message-----
From: Jay Kreps [mailto:[email protected]] 
Sent: Wednesday, June 07, 2017 2:27 AM
To: [email protected]
Subject: Re: [DISCUSS] KIP-113: Support replicas movement between log 
directories

I think Ram's point is that in place failure is pretty complicated, and this is 
meant to be a cost saving feature, we should construct an argument for it 
grounded in data.

Assume an annual failure rate of 1% (reasonable, but data is available online), 
and assume it takes 3 days to get the drive replaced. Say you have
10 drives per server. Then the expected downtime for each server is roughly 1% 
* 3 days * 10 = 0.3 days/year (this is slightly off since I'm ignoring the case 
of multiple failures, but I don't know that changes it much). So the savings 
from this feature is 0.3/365 = 0.08%. Say you have 1000 servers and they cost 
$3000/year fully loaded including power, the cost of the hw amortized over it's 
life, etc. Then this feature saves you $3000 on your total server cost of $3m 
which seems not very worthwhile compared to other optimizations...?

Anyhow, not sure the arithmetic is right there, but i think that is the type of 
argument that would be helpful to think about the tradeoff in complexity.

-Jay



On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <[email protected]> wrote:

> Hey Sriram,
>
> Thanks for taking time to review the KIP. Please see below my answers 
> to your questions:
>
> >1. Could you pick a hardware/Kafka configuration and go over what is 
> >the average disk/partition repair/restore time that we are targeting 
> >for a typical JBOD setup?
>
> We currently don't have this data. I think the disk/partition 
> repair/store time depends on availability of hardware, the response 
> time of site-reliability engineer, the amount of data on the bad disk 
> etc. These vary between companies and even clusters within the same 
> company and it is probably hard to determine what is the average situation.
>
> I am not very sure why we need this. Can you explain a bit why this 
> data is useful to evaluate the motivation and design of this KIP?
>
> >2. How often do we believe disks are going to fail (in your example
> >configuration) and what do we gain by avoiding the network overhead 
> >and doing all the work of moving the replica within the broker to 
> >another disk instead of balancing it globally?
>
> I think the chance of disk failure depends mainly on the disk itself 
> rather than the broker configuration. I don't have this data now. I 
> will ask our SRE whether they know the mean-time-to-fail for our disk. 
> What I was told by SRE is that disk failure is the most common type of 
> hardware failure.
>
> When there is disk failure, I think it is reasonable to move replica 
> to another broker instead of another disk on the same broker. The 
> reason we want to move replica within broker is mainly to optimize the 
> Kafka cluster performance when we balance load across disks.
>
> In comparison to balancing replicas globally, the benefit of moving 
> replica within broker is that:
>
> 1) the movement is faster since it doesn't go through socket or rely 
> on the available network bandwidth;
> 2) much less impact on the replication traffic between broker by not 
> taking up bandwidth between brokers. Depending on the pattern of 
> traffic, we may need to balance load across disk frequently and it is 
> necessary to prevent this operation from slowing down the existing 
> operation (e.g. produce, consume, replication) in the Kafka cluster.
> 3) It gives us opportunity to do automatic broker rebalance between 
> disks on the same broker.
>
>
> >3. Even if we had to move the replica within the broker, why cannot 
> >we
> just
> >treat it as another replica and have it go through the same 
> >replication code path that we have today? The downside here is 
> >obviously that you need to catchup from the leader but it is 
> >completely free! What do we think is the impact of the network overhead in 
> >this case?
>
> Good point. My initial proposal actually used the existing 
> ReplicaFetcherThread (i.e. the existing code path) to move replica 
> between disks. However, I switched to use separate thread pool after 
> discussion with Jun and Becket.
>
> The main argument for using separate thread pool is to actually keep 
> the design simply and easy to reason about. There are a number of 
> difference between inter-broker replication and intra-broker 
> replication which makes it cleaner to do them in separate code path. I will 
> list them below:
>
> - The throttling mechanism for inter-broker replication traffic and 
> intra-broker replication traffic is different. For example, we may 
> want to specify per-topic quota for inter-broker replication traffic 
> because we may want some topic to be moved faster than other topic. 
> But we don't care about priority of topics for intra-broker movement. 
> So the current proposal only allows user to specify per-broker quota 
> for inter-broker replication traffic.
>
> - The quota value for inter-broker replication traffic and 
> intra-broker replication traffic is different. The available bandwidth 
> for inter-broker replication can probably be much higher than the 
> bandwidth for inter-broker replication.
>
> - The ReplicaFetchThread is per broker. Intuitively, the number of 
> threads doing intra broker data movement should be related to the 
> number of disks in the broker, not the number of brokers in the cluster.
>
> - The leader replica has no ReplicaFetchThread to start with. It seems 
> weird to start one just for intra-broker replication.
>
> Because of these difference, we think it is simpler to use separate 
> thread pool and code path so that we can configure and throttle them 
> separately.
>
>
> >4. What are the chances that we will be able to identify another disk 
> >to balance within the broker instead of another disk on another 
> >broker? If we have 100's of machines, the probability of finding a 
> >better balance by choosing another broker is much higher than balancing 
> >within the broker.
> >Could you add some info on how we are determining this?
>
> It is possible that we can find available space on a remote broker. 
> The benefit of allowing intra-broker replication is that, when there 
> are available space in both the current broker and a remote broker, 
> the rebalance can be completed faster with much less impact on the 
> inter-broker replication or the users traffic. It is about taking 
> advantage of locality when balance the load.
>
> >5. Finally, in a cloud setup where more users are going to leverage a 
> >shared filesystem (example, EBS in AWS), all this change is not of 
> >much gain since you don't need to balance between the volumes within 
> >the same broker.
>
> You are right. This KIP-113 is useful only if user uses JBOD. If user 
> uses an extra storage layer of replication, such as RAID-10 or EBS, 
> they don't need KIP-112 or KIP-113. Note that user will replicate data 
> more times than the replication factor of the Kafka topic if an extra 
> storage layer of replication is used.
>

RE: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to