Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Wed, 07 Jun 2017 13:04:15 -0700

Hey Jay, Sriram,

Great point. If I understand you right, you are suggesting that we can
simply use RAID-0 so that the load can be evenly distributed across disks.
And even though a disk failure will bring down the enter broker, the
reduced availability as compared to using KIP-112 and KIP-113 will may be
negligible. And it may be better to just accept the slightly reduced
availability instead of introducing the complexity from KIP-112 and KIP-113.


Let's assume the following:

- There are 30 brokers in a cluster and each broker has 10 disks
- The replication factor is 3 and min.isr = 2.
- The probability of annual disk failure rate is 2% according to this
<https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/> blog.
- It takes 3 days to replace a disk.

Here is my calculation for probability of data loss due to disk failure:
probability of a given disk fails in a given year: 2%
probability of a given disk stays offline for one day in a given day: 2% /
365 * 3
probability of a given broker stays offline for one day in a given day due
to disk failure: 2% / 365 * 3 * 10
probability of any broker stays offline for one day in a given day due to
disk failure: 2% / 365 * 3 * 10 * 30 = 5%
probability of any three broker stays offline for one day in a given day
due to disk failure: 5% * 5% * 5% = 0.0125%
probability of data loss due to disk failure: 0.0125%

Here is my calculation for probability of service unavailability due to
disk failure:
probability of a given disk fails in a given year: 2%
probability of a given disk stays offline for one day in a given day: 2% /
365 * 3
probability of a given broker stays offline for one day in a given day due
to disk failure: 2% / 365 * 3 * 10
probability of any broker stays offline for one day in a given day due to
disk failure: 2% / 365 * 3 * 10 * 30 = 5%
probability of any two broker stays offline for one day in a given day due
to disk failure: 5% * 5% * 5% = 0.25%
probability of unavailability due to disk failure: 0.25%

Note that the unavailability due to disk failure will be unacceptably high
in this case. And the probability of data loss due to disk failure will be
higher than 0.01%. Neither is acceptable if Kafka is intended to achieve
four nigh availability.

Thanks,
Dong


On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <j...@confluent.io> wrote:

> I think Ram's point is that in place failure is pretty complicated, and
> this is meant to be a cost saving feature, we should construct an argument
> for it grounded in data.
>
> Assume an annual failure rate of 1% (reasonable, but data is available
> online), and assume it takes 3 days to get the drive replaced. Say you have
> 10 drives per server. Then the expected downtime for each server is roughly
> 1% * 3 days * 10 = 0.3 days/year (this is slightly off since I'm ignoring
> the case of multiple failures, but I don't know that changes it much). So
> the savings from this feature is 0.3/365 = 0.08%. Say you have 1000 servers
> and they cost $3000/year fully loaded including power, the cost of the hw
> amortized over it's life, etc. Then this feature saves you $3000 on your
> total server cost of $3m which seems not very worthwhile compared to other
> optimizations...?
>
> Anyhow, not sure the arithmetic is right there, but i think that is the
> type of argument that would be helpful to think about the tradeoff in
> complexity.
>
> -Jay
>
>
>
> On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Sriram,
> >
> > Thanks for taking time to review the KIP. Please see below my answers to
> > your questions:
> >
> > >1. Could you pick a hardware/Kafka configuration and go over what is the
> > >average disk/partition repair/restore time that we are targeting for a
> > >typical JBOD setup?
> >
> > We currently don't have this data. I think the disk/partition
> repair/store
> > time depends on availability of hardware, the response time of
> > site-reliability engineer, the amount of data on the bad disk etc. These
> > vary between companies and even clusters within the same company and it
> is
> > probably hard to determine what is the average situation.
> >
> > I am not very sure why we need this. Can you explain a bit why this data
> is
> > useful to evaluate the motivation and design of this KIP?
> >
> > >2. How often do we believe disks are going to fail (in your example
> > >configuration) and what do we gain by avoiding the network overhead and
> > >doing all the work of moving the replica within the broker to another
> disk
> > >instead of balancing it globally?
> >
> > I think the chance of disk failure depends mainly on the disk itself
> rather
> > than the broker configuration. I don't have this data now. I will ask our
> > SRE whether they know the mean-time-to-fail for our disk. What I was told
> > by SRE is that disk failure is the most common type of hardware failure.
> >
> > When there is disk failure, I think it is reasonable to move replica to
> > another broker instead of another disk on the same broker. The reason we
> > want to move replica within broker is mainly to optimize the Kafka
> cluster
> > performance when we balance load across disks.
> >
> > In comparison to balancing replicas globally, the benefit of moving
> replica
> > within broker is that:
> >
> > 1) the movement is faster since it doesn't go through socket or rely on
> the
> > available network bandwidth;
> > 2) much less impact on the replication traffic between broker by not
> taking
> > up bandwidth between brokers. Depending on the pattern of traffic, we may
> > need to balance load across disk frequently and it is necessary to
> prevent
> > this operation from slowing down the existing operation (e.g. produce,
> > consume, replication) in the Kafka cluster.
> > 3) It gives us opportunity to do automatic broker rebalance between disks
> > on the same broker.
> >
> >
> > >3. Even if we had to move the replica within the broker, why cannot we
> > just
> > >treat it as another replica and have it go through the same replication
> > >code path that we have today? The downside here is obviously that you
> need
> > >to catchup from the leader but it is completely free! What do we think
> is
> > >the impact of the network overhead in this case?
> >
> > Good point. My initial proposal actually used the existing
> > ReplicaFetcherThread (i.e. the existing code path) to move replica
> between
> > disks. However, I switched to use separate thread pool after discussion
> > with Jun and Becket.
> >
> > The main argument for using separate thread pool is to actually keep the
> > design simply and easy to reason about. There are a number of difference
> > between inter-broker replication and intra-broker replication which makes
> > it cleaner to do them in separate code path. I will list them below:
> >
> > - The throttling mechanism for inter-broker replication traffic and
> > intra-broker replication traffic is different. For example, we may want
> to
> > specify per-topic quota for inter-broker replication traffic because we
> may
> > want some topic to be moved faster than other topic. But we don't care
> > about priority of topics for intra-broker movement. So the current
> proposal
> > only allows user to specify per-broker quota for inter-broker replication
> > traffic.
> >
> > - The quota value for inter-broker replication traffic and intra-broker
> > replication traffic is different. The available bandwidth for
> inter-broker
> > replication can probably be much higher than the bandwidth for
> inter-broker
> > replication.
> >
> > - The ReplicaFetchThread is per broker. Intuitively, the number of
> threads
> > doing intra broker data movement should be related to the number of disks
> > in the broker, not the number of brokers in the cluster.
> >
> > - The leader replica has no ReplicaFetchThread to start with. It seems
> > weird to
> > start one just for intra-broker replication.
> >
> > Because of these difference, we think it is simpler to use separate
> thread
> > pool and code path so that we can configure and throttle them separately.
> >
> >
> > >4. What are the chances that we will be able to identify another disk to
> > >balance within the broker instead of another disk on another broker? If
> we
> > >have 100's of machines, the probability of finding a better balance by
> > >choosing another broker is much higher than balancing within the broker.
> > >Could you add some info on how we are determining this?
> >
> > It is possible that we can find available space on a remote broker. The
> > benefit of allowing intra-broker replication is that, when there are
> > available space in both the current broker and a remote broker, the
> > rebalance can be completed faster with much less impact on the
> inter-broker
> > replication or the users traffic. It is about taking advantage of
> locality
> > when balance the load.
> >
> > >5. Finally, in a cloud setup where more users are going to leverage a
> > >shared filesystem (example, EBS in AWS), all this change is not of much
> > >gain since you don't need to balance between the volumes within the same
> > >broker.
> >
> > You are right. This KIP-113 is useful only if user uses JBOD. If user
> uses
> > an extra storage layer of replication, such as RAID-10 or EBS, they don't
> > need KIP-112 or KIP-113. Note that user will replicate data more times
> than
> > the replication factor of the Kafka topic if an extra storage layer of
> > replication is used.
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to