showuon opened a new pull request #10871:
URL: https://github.com/apache/kafka/pull/10871


   While there might still be some issue about the test as described 
[here](https://issues.apache.org/jira/browse/KAFKA-8940?focusedCommentId=17214850&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17214850)
 by @ableegoldman , but I found the reason why this test failed quite 
frequently recently. It's because we increased the session timeout to 45 sec in 
KIP-735.
   
   We can check the jenkins failing trend in `trunk` branch 
[here](https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/217/testReport/junit/org.apache.kafka.streams.integration/SmokeTestDriverIntegrationTest/history/):
   
![image](https://user-images.githubusercontent.com/43372967/121793776-0ee2a700-cc35-11eb-9648-f93fe3e34976.png)
   This test never failed since build # 168, until build # 206 and later
   
   The reason why increasing session timeout affected this test is because in 
this test, we will keep adding new stream clients and remove old one, to 
maintain only 3 stream clients alive. The problem here is, when old stream 
closed, we won't trigger rebalance immediately due to the stream clients are 
all static members as described in KIP-345, which means, we will trigger 
trigger group rebalance only when `session.timeout` expired. That said, when 
old client closed, we'll have at least 45 sec with some tasks not working. 
   
   Also, in this test, we have 2 timeout conditions to fail this test before 
verification passed:
   1. 6 minutes timeout
   2. polling 30 times (each with 5 seconds) without getting any data. (that 
is, 5 * 30 = 150 sec without consuming any data)
   
   For (1), in my test under 45 session timeout, we'll create 8 stream clients, 
which means, we'll have 5 clients got closed. And each closed client need 45 
sec to trigger rebalance, so we'll have 45 * 5 = 225 sec (~4 mins) of the time 
having some tasks not working. 
   For (2), during new client created and old client closed, it need some time 
to do rebalance. With 45 session timeout, we only got ~100 sec left. In slow 
jenkins env, it might reach the 30 retries without getting any data timeout.
   
   Therefore, decreasing session timeout can make this test completes faster 
and more reliable.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to