[jira] [Created] (KAFKA-6414) Inverse replication for replicas that are far behind

Ivan Babrou (JIRA) Mon, 01 Jan 2018 18:14:25 -0800

Ivan Babrou created KAFKA-6414:
----------------------------------

             Summary: Inverse replication for replicas that are far behind
                 Key: KAFKA-6414
                 URL: https://issues.apache.org/jira/browse/KAFKA-6414
             Project: Kafka
          Issue Type: Bug
          Components: replication
            Reporter: Ivan Babrou



Let's suppose the following starting point:

* 1 topic
* 1 partition
* 1 reader
* 24h retention period
* leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
reader + 1x slack = total outbound)

In this scenario, when replica fails and needs to be brought back from scratch, 
you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack 
used).

2x catch-up speed means replica will be at the point where leader is now in 24h 
/ 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
retention cliff and will be deleted. There's absolutely to use for this data, 
it will never be read from the replica in any scenario. And this not even 
including the fact that we still need to replicate 12h more of data that 
accumulated since the time we started.

My suggestion is to refill sufficiently out of sync replicas backwards from the 
tip: newest segments first, oldest segments last. Then we can stop when we hit 
retention cliff and replicate far less data. The lower the ratio of catch-up 
bandwidth to inbound bandwidth, the higher the returns would be. This will also 
set a hard cap on retention time: it will be no higher than retention period if 
catch-up speed if >1x (if it's less, you're forever out of ISR anyway).

What exactly "sufficiently out of sync" means in terms of lag is a topic for a 
debate. The default segment size is 1GiB, I'd say that being >1 full segments 
behind probably warrants this.

As of now, the solution for slow recovery appears to be to reduce retention to 
speed up recovery, which doesn't seem very friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (KAFKA-6414) Inverse replication for replicas that are far behind

Reply via email to