li xiangyuan created KAFKA-9211:
-----------------------------------

             Summary: kafka upgrade 2.3.0 cause produce speed decrease
                 Key: KAFKA-9211
                 URL: https://issues.apache.org/jira/browse/KAFKA-9211
             Project: Kafka
          Issue Type: Bug
          Components: controller, producer 
    Affects Versions: 2.3.0
            Reporter: li xiangyuan
         Attachments: broker-jstack.txt, producer-jstack.txt

Recently we try upgrade kafka from 0.10.0.1 to 2.3.0.

we have 15 clusters in production env, each one has 3~6 brokers.

we know kafka upgrade should:
      1.replcae code to 2.3.0.jar and restart  all brokers one by one
      2.unset inter.broker.protocol.version=0.10.0.1 and restart all brokers 
one by one
      3.unset log.message.format.version=0.10.0.1 and restart all brokers one 
by one
 
for now we have already done step 1 & 2 in 12 clusters.but when we try to 
upgrade left clusters (already done step 1) in step 2, we found some topics 
drop produce speed badly.
     we have research this issue for long time, since we couldn't test it in 
production environment  and we couldn't reproduce in test environment, we 
couldn't find the root cause.
now we only could describe the situation in detail as  i know, hope anyone 
could help us.
 
1.because bug KAFKA-8653, i add code below in KafkaApis.scala 
handleJoinGroupRequest function:
{code:java}
if (rebalanceTimeoutMs <= 0) {
 rebalanceTimeoutMs = joinGroupRequest.data.sessionTimeoutMs
}{code}

2.one cluster upgrade failed has 6 8C16G brokers, about 200 topics with 2 
replicas,every broker keep 3000+ partitions and 1500+ leader partition, but 
most of them has very low produce message speed,about less than 50messages/sec, 
only one topic with 300 partitions has more than 2500 message/sec with more 
than 20 consumer groups consume message from it.

so this whole cluster  produce 4K messages/sec , 11m Bytes in /sec,240m Bytes 
out /sec.and more than 90% traffic made by that topic has 2500messages/sec.

when we unset 5 or 6 servers' inter.broker.protocol.version=0.10.0.1  and 
restart, this topic produce message drop to about 200messages/sec,  i don't 
know whether the way we use could tirgger any problem.

3.we use kafka wrapped by spring-kafka and set kafkatemplate's autoFlush=true, 
so each producer.send execution will execute producer.flush immediately too.i 
know flush method will decrease produce performance dramaticlly, but  at least 
it seems nothing wrong before upgrade step 2. but i doubt whether it's a 
problem now after upgrade.

4.I noticed when produce speed decrease, some consumer group has large message 
lag still consume message without any consume speed change or decrease, so I 
guess only producerequest speed will drop down,but fetchrequest not. 

5.we haven't set any throttle configuration, and all producers' acks=1(so it's 
not broker replica fetch slow), and when this problem triggered, both sever & 
producers cpu usage down, and servers' ioutil keep less than 30% ,so it 
shuldn't be a hardware problem.

6.this event triggered often(almost 100%) most brokers has done upgrade step 
2,then after a auto leader replica election executed, then we can observe  
produce speed drop down,and we have to downgrade brokers(set 
inter.broker.protocol.version=0.10.0.1)and restart brokers one by one,then it 
could be normal. some cluster have to downgrade all brokers,but some cluster 
could left 1 or 2 brokers without downgrade, i notice that the broker not need 
downgrade is the controller.

7.I have print jstack for producer & servers. although I do this not the same 
cluster, but we can notice that their thread seems really in idle stat.

8.both 0.10.0.1 & 2.3.0 kafka-client will trigger this problem too.

8.unless the largest one topic will drop produce speed certainly, other topic 
will drop produce speed randomly. maybe topicA will drop speed in first upgrade 
attempt but next not, and topicB not drop speed in first attemp but dropped 
when do another attempt.

9.in fact, the largest cluster, has the same topic & group usage scenario 
mentioned above, but the largest topic has 1w2 messages/sec,will upgrade fail 
in step 1(just use 2.3.0.jar)


any help would be grateful, thx, i'm very sad now...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to