[ https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359925#comment-14359925 ]
Aditya Auradkar commented on KAFKA-1546: ---------------------------------------- I ran a bunch of tests on my patch for KAFKA-1546. I started a cluster and used the PerformanceTest class to throw a ton of load. 1. Verify that the process stays in ISR for a large volume of messages. Generated lots of load with small messages and very high throughout. I noticed that the replica did not fall out of ISR. The previous solution would have fluctuated in and out of ISR. bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 50000000000 100 -1 acks=1 bootstrap.servers=localhost:9092 buffer.memory=67108864 batch.size=8196 2. Stuck follower - Generated some load and paused the follower process using SIGSTOP. I raised the zk session timeout so the process stayed registered with ZK but did not send a fetch request for 'n' seconds. This threw it out of ISR as expected. 3. Lagging follower - I was able to to do this by reducing the max fetch size on the follower instance. This made it impossible for the follower to catch up causing it to be removed from ISR. 4. I also simulated the case where the follower was down for a long time and the leader had accumulated a significant amount of data. On starting the follower, it stayed out of ISR until it caught up to the log end offset. > Automate replica lag tuning > --------------------------- > > Key: KAFKA-1546 > URL: https://issues.apache.org/jira/browse/KAFKA-1546 > Project: Kafka > Issue Type: Improvement > Components: replication > Affects Versions: 0.8.0, 0.8.1, 0.8.1.1 > Reporter: Neha Narkhede > Assignee: Aditya Auradkar > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1546.patch, KAFKA-1546_2015-03-11_18:48:09.patch, > KAFKA-1546_2015-03-12_13:42:01.patch > > > Currently, there is no good way to tune the replica lag configs to > automatically account for high and low volume topics on the same cluster. > For the low-volume topic it will take a very long time to detect a lagging > replica, and for the high-volume topic it will have false-positives. > One approach to making this easier would be to have the configuration > be something like replica.lag.max.ms and translate this into a number > of messages dynamically based on the throughput of the partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)