Hey Sriram,

Thanks for taking time to review the KIP. Please see below my answers to
your questions:

>1. Could you pick a hardware/Kafka configuration and go over what is the
>average disk/partition repair/restore time that we are targeting for a
>typical JBOD setup?

We currently don't have this data. I think the disk/partition repair/store
time depends on availability of hardware, the response time of
site-reliability engineer, the amount of data on the bad disk etc. These
vary between companies and even clusters within the same company and it is
probably hard to determine what is the average situation.

I am not very sure why we need this. Can you explain a bit why this data is
useful to evaluate the motivation and design of this KIP?

>2. How often do we believe disks are going to fail (in your example
>configuration) and what do we gain by avoiding the network overhead and
>doing all the work of moving the replica within the broker to another disk
>instead of balancing it globally?

I think the chance of disk failure depends mainly on the disk itself rather
than the broker configuration. I don't have this data now. I will ask our
SRE whether they know the mean-time-to-fail for our disk. What I was told
by SRE is that disk failure is the most common type of hardware failure.

When there is disk failure, I think it is reasonable to move replica to
another broker instead of another disk on the same broker. The reason we
want to move replica within broker is mainly to optimize the Kafka cluster
performance when we balance load across disks.

In comparison to balancing replicas globally, the benefit of moving replica
within broker is that:

1) the movement is faster since it doesn't go through socket or rely on the
available network bandwidth;
2) much less impact on the replication traffic between broker by not taking
up bandwidth between brokers. Depending on the pattern of traffic, we may
need to balance load across disk frequently and it is necessary to prevent
this operation from slowing down the existing operation (e.g. produce,
consume, replication) in the Kafka cluster.
3) It gives us opportunity to do automatic broker rebalance between disks
on the same broker.


>3. Even if we had to move the replica within the broker, why cannot we just
>treat it as another replica and have it go through the same replication
>code path that we have today? The downside here is obviously that you need
>to catchup from the leader but it is completely free! What do we think is
>the impact of the network overhead in this case?

Good point. My initial proposal actually used the existing
ReplicaFetcherThread (i.e. the existing code path) to move replica between
disks. However, I switched to use separate thread pool after discussion
with Jun and Becket.

The main argument for using separate thread pool is to actually keep the
design simply and easy to reason about. There are a number of difference
between inter-broker replication and intra-broker replication which makes
it cleaner to do them in separate code path. I will list them below:

- The throttling mechanism for inter-broker replication traffic and
intra-broker replication traffic is different. For example, we may want to
specify per-topic quota for inter-broker replication traffic because we may
want some topic to be moved faster than other topic. But we don't care
about priority of topics for intra-broker movement. So the current proposal
only allows user to specify per-broker quota for inter-broker replication
traffic.

- The quota value for inter-broker replication traffic and intra-broker
replication traffic is different. The available bandwidth for inter-broker
replication can probably be much higher than the bandwidth for inter-broker
replication.

- The ReplicaFetchThread is per broker. Intuitively, the number of threads
doing intra broker data movement should be related to the number of disks
in the broker, not the number of brokers in the cluster.

- The leader replica has no ReplicaFetchThread to start with. It seems weird to
start one just for intra-broker replication.

Because of these difference, we think it is simpler to use separate thread
pool and code path so that we can configure and throttle them separately.


>4. What are the chances that we will be able to identify another disk to
>balance within the broker instead of another disk on another broker? If we
>have 100's of machines, the probability of finding a better balance by
>choosing another broker is much higher than balancing within the broker.
>Could you add some info on how we are determining this?

It is possible that we can find available space on a remote broker. The
benefit of allowing intra-broker replication is that, when there are
available space in both the current broker and a remote broker, the
rebalance can be completed faster with much less impact on the inter-broker
replication or the users traffic. It is about taking advantage of locality
when balance the load.

>5. Finally, in a cloud setup where more users are going to leverage a
>shared filesystem (example, EBS in AWS), all this change is not of much
>gain since you don't need to balance between the volumes within the same
>broker.

You are right. This KIP-113 is useful only if user uses JBOD. If user uses
an extra storage layer of replication, such as RAID-10 or EBS, they don't
need KIP-112 or KIP-113. Note that user will replicate data more times than
the replication factor of the Kafka topic if an extra storage layer of
replication is used.

Reply via email to