[ 
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359925#comment-14359925
 ] 

Aditya Auradkar commented on KAFKA-1546:
----------------------------------------

I ran a bunch of tests on my patch for KAFKA-1546. I started a cluster and used 
the PerformanceTest class to throw a ton of load.

1. Verify that the process stays in ISR for a large volume of messages. 
Generated lots of load with small messages and very high throughout. I noticed 
that the replica did not fall out of ISR. The previous solution would have 
fluctuated in and out of ISR.
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test 
50000000000 100 -1 acks=1 bootstrap.servers=localhost:9092 
buffer.memory=67108864 batch.size=8196

2. Stuck follower - Generated some load and paused the follower process using 
SIGSTOP. I raised the zk session timeout so the process stayed registered with 
ZK but did not send a fetch request for 'n' seconds. This threw it out of ISR 
as expected.

3. Lagging follower - I was able to to do this by reducing the max fetch size 
on the follower instance. This made it impossible for the follower to catch up 
causing it to be removed from ISR.

4. I also simulated the case where the follower was down for a long time and 
the leader had accumulated a significant amount of data. On starting the 
follower, it stayed out of ISR until it caught up to the log end offset.


> Automate replica lag tuning
> ---------------------------
>
>                 Key: KAFKA-1546
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1546
>             Project: Kafka
>          Issue Type: Improvement
>          Components: replication
>    Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
>            Reporter: Neha Narkhede
>            Assignee: Aditya Auradkar
>              Labels: newbie++
>             Fix For: 0.8.3
>
>         Attachments: KAFKA-1546.patch, KAFKA-1546_2015-03-11_18:48:09.patch, 
> KAFKA-1546_2015-03-12_13:42:01.patch
>
>
> Currently, there is no good way to tune the replica lag configs to 
> automatically account for high and low volume topics on the same cluster. 
> For the low-volume topic it will take a very long time to detect a lagging
> replica, and for the high-volume topic it will have false-positives.
> One approach to making this easier would be to have the configuration
> be something like replica.lag.max.ms and translate this into a number
> of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to