Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Jay Kreps Tue, 06 Jun 2017 23:27:07 -0700

I think Ram's point is that in place failure is pretty complicated, and
this is meant to be a cost saving feature, we should construct an argument
for it grounded in data.


Assume an annual failure rate of 1% (reasonable, but data is available
online), and assume it takes 3 days to get the drive replaced. Say you have
10 drives per server. Then the expected downtime for each server is roughly
1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm ignoring
the case of multiple failures, but I don't know that changes it much). So
the savings from this feature is 0.3/365 = 0.08%. Say you have 1000 servers
and they cost $3000/year fully loaded including power, the cost of the hw
amortized over it's life, etc. Then this feature saves you $3000 on your
total server cost of $3m which seems not very worthwhile compared to other
optimizations...?

Anyhow, not sure the arithmetic is right there, but i think that is the
type of argument that would be helpful to think about the tradeoff in
complexity.

-Jay



On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Sriram,
>
> Thanks for taking time to review the KIP. Please see below my answers to
> your questions:
>
> >1. Could you pick a hardware/Kafka configuration and go over what is the
> >average disk/partition repair/restore time that we are targeting for a
> >typical JBOD setup?
>
> We currently don't have this data. I think the disk/partition repair/store
> time depends on availability of hardware, the response time of
> site-reliability engineer, the amount of data on the bad disk etc. These
> vary between companies and even clusters within the same company and it is
> probably hard to determine what is the average situation.
>
> I am not very sure why we need this. Can you explain a bit why this data is
> useful to evaluate the motivation and design of this KIP?
>
> >2. How often do we believe disks are going to fail (in your example
> >configuration) and what do we gain by avoiding the network overhead and
> >doing all the work of moving the replica within the broker to another disk
> >instead of balancing it globally?
>
> I think the chance of disk failure depends mainly on the disk itself rather
> than the broker configuration. I don't have this data now. I will ask our
> SRE whether they know the mean-time-to-fail for our disk. What I was told
> by SRE is that disk failure is the most common type of hardware failure.
>
> When there is disk failure, I think it is reasonable to move replica to
> another broker instead of another disk on the same broker. The reason we
> want to move replica within broker is mainly to optimize the Kafka cluster
> performance when we balance load across disks.
>
> In comparison to balancing replicas globally, the benefit of moving replica
> within broker is that:
>
> 1) the movement is faster since it doesn't go through socket or rely on the
> available network bandwidth;
> 2) much less impact on the replication traffic between broker by not taking
> up bandwidth between brokers. Depending on the pattern of traffic, we may
> need to balance load across disk frequently and it is necessary to prevent
> this operation from slowing down the existing operation (e.g. produce,
> consume, replication) in the Kafka cluster.
> 3) It gives us opportunity to do automatic broker rebalance between disks
> on the same broker.
>
>
> >3. Even if we had to move the replica within the broker, why cannot we
> just
> >treat it as another replica and have it go through the same replication
> >code path that we have today? The downside here is obviously that you need
> >to catchup from the leader but it is completely free! What do we think is
> >the impact of the network overhead in this case?
>
> Good point. My initial proposal actually used the existing
> ReplicaFetcherThread (i.e. the existing code path) to move replica between
> disks. However, I switched to use separate thread pool after discussion
> with Jun and Becket.
>
> The main argument for using separate thread pool is to actually keep the
> design simply and easy to reason about. There are a number of difference
> between inter-broker replication and intra-broker replication which makes
> it cleaner to do them in separate code path. I will list them below:
>
> - The throttling mechanism for inter-broker replication traffic and
> intra-broker replication traffic is different. For example, we may want to
> specify per-topic quota for inter-broker replication traffic because we may
> want some topic to be moved faster than other topic. But we don't care
> about priority of topics for intra-broker movement. So the current proposal
> only allows user to specify per-broker quota for inter-broker replication
> traffic.
>
> - The quota value for inter-broker replication traffic and intra-broker
> replication traffic is different. The available bandwidth for inter-broker
> replication can probably be much higher than the bandwidth for inter-broker
> replication.
>
> - The ReplicaFetchThread is per broker. Intuitively, the number of threads
> doing intra broker data movement should be related to the number of disks
> in the broker, not the number of brokers in the cluster.
>
> - The leader replica has no ReplicaFetchThread to start with. It seems
> weird to
> start one just for intra-broker replication.
>
> Because of these difference, we think it is simpler to use separate thread
> pool and code path so that we can configure and throttle them separately.
>
>
> >4. What are the chances that we will be able to identify another disk to
> >balance within the broker instead of another disk on another broker? If we
> >have 100's of machines, the probability of finding a better balance by
> >choosing another broker is much higher than balancing within the broker.
> >Could you add some info on how we are determining this?
>
> It is possible that we can find available space on a remote broker. The
> benefit of allowing intra-broker replication is that, when there are
> available space in both the current broker and a remote broker, the
> rebalance can be completed faster with much less impact on the inter-broker
> replication or the users traffic. It is about taking advantage of locality
> when balance the load.
>
> >5. Finally, in a cloud setup where more users are going to leverage a
> >shared filesystem (example, EBS in AWS), all this change is not of much
> >gain since you don't need to balance between the volumes within the same
> >broker.
>
> You are right. This KIP-113 is useful only if user uses JBOD. If user uses
> an extra storage layer of replication, such as RAID-10 or EBS, they don't
> need KIP-112 or KIP-113. Note that user will replicate data more times than
> the replication factor of the Kafka topic if an extra storage layer of
> replication is used.
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to