[jira] [Updated] (KAFKA-13367) Performance Degradation during introducing Network Delay

Thomas Heinze (Jira) Tue, 12 Oct 2021 02:12:04 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Heinze updated KAFKA-13367:
----------------------------------
    Description: 
Hi Kafka community,

 

we are running a few chaos experiments to simulate Kafka's behaviour during 
issues in the data center. To simulate a slow network we run the following 
command on two out of six brokers (the brokers are spread across 3 AZs on AWS, 
we run the command on two brokers in the same AZ):
{code:java}
tc qdisc add dev eth0 root netem delay x ms 
 {code}
 
 At the same time we are running some Kafka producers inserting roughly 4k 
messages per second to a Kafka topic with 10 partitions and 3 replicas and 
using min-isr=2. What we observe is the following:
 * *Introducing a 1000 ms delay (07:10 to 07:20 in the attached picture)*: The 
producer see significant response time delays, the throughput drops to 2k per 
second
 * *Introducing a 2000 ms delay (07:25 to 07:35 in the attached picture)*: The 
producer delay increases further, the throughput drops to 300 messages per 
second
 * *Introducing a 5000 ms delay*: The Kafka clusters remove the slow brokers 
from the list of active replicas and the incoming messages for the remaining 
brokers increases. This is the expected behaviour imho.

What parameters would influence this behaviour? How can I make sure Kafka shows 
the behaviour like for 5 seconds even for smaller delays? We would like to make 
sure that we can guarantee around a certain throughput, even if one AZ is very 
slow.

I already tried to set "replica.lag.time.max.ms" to very small values, but I 
only observe that Kafka adds and remove the replicas on the slow nodes 
constantly from the set of ISR.

 

 

 

  was:
Hi Kafka community,

 

we are running a few chaos experiments to simulate Kafka's behaviour during 
issues in the data center. To simulate a slow network we run the following 
command on two out of six brokers (the brokers are spread across 3 AZs on AWS, 
we run the command on two brokers in the same AZ):
{code:java}
tc qdisc add dev eth0 root netem delay x ms 
 {code}
 
 At the same time we are running some Kafka producers inserting roughly 4k 
messages per second to a Kafka topic with 10 partitions and 3 replicas and 
using min-isr=2. What we observe is the following:
 * *Introducing a 1000 ms delay (7:10 to 7:20 in the attached picture)*: The 
producer see significant response time delays, the throughput drops to 2k per 
second
 * *Introducing a 2000 ms delay*: The producer delay increases further, the 
throughput drops to 300 messages per second
 * *Introducing a 5000 ms delay*: The Kafka clusters remove the slow brokers 
from the list of active replicas and the incoming messages for the remaining 
brokers increases. This is the expected behaviour imho.

What parameters would influence this behaviour? How can I make sure Kafka shows 
the behaviour like for 5 seconds even for smaller delays? We would like to make 
sure that we can guarantee around a certain throughput, even if one AZ is very 
slow.

I already tried to set "replica.lag.time.max.ms" to very small values, but I 
only observe that Kafka adds and remove the replicas on the slow nodes 
constantly from the set of ISR.

 

 

 


> Performance Degradation during introducing Network Delay
> --------------------------------------------------------
>
>                 Key: KAFKA-13367
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13367
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.5.1
>         Environment: We are running Kafka 2.5 on m4.xlarge VMs on AWS.
>            Reporter: Thomas Heinze
>            Priority: Major
>         Attachments: KafkaChaosTests.png, server.properties
>
>
> Hi Kafka community,
>  
> we are running a few chaos experiments to simulate Kafka's behaviour during 
> issues in the data center. To simulate a slow network we run the following 
> command on two out of six brokers (the brokers are spread across 3 AZs on 
> AWS, we run the command on two brokers in the same AZ):
> {code:java}
> tc qdisc add dev eth0 root netem delay x ms 
>  {code}
>  
>  At the same time we are running some Kafka producers inserting roughly 4k 
> messages per second to a Kafka topic with 10 partitions and 3 replicas and 
> using min-isr=2. What we observe is the following:
>  * *Introducing a 1000 ms delay (07:10 to 07:20 in the attached picture)*: 
> The producer see significant response time delays, the throughput drops to 2k 
> per second
>  * *Introducing a 2000 ms delay (07:25 to 07:35 in the attached picture)*: 
> The producer delay increases further, the throughput drops to 300 messages 
> per second
>  * *Introducing a 5000 ms delay*: The Kafka clusters remove the slow brokers 
> from the list of active replicas and the incoming messages for the remaining 
> brokers increases. This is the expected behaviour imho.
> What parameters would influence this behaviour? How can I make sure Kafka 
> shows the behaviour like for 5 seconds even for smaller delays? We would like 
> to make sure that we can guarantee around a certain throughput, even if one 
> AZ is very slow.
> I already tried to set "replica.lag.time.max.ms" to very small values, but I 
> only observe that Kafka adds and remove the replicas on the slow nodes 
> constantly from the set of ISR.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13367) Performance Degradation during introducing Network Delay

Reply via email to