[GitHub] [kafka] gharris1727 commented on a diff in pull request #13036: KAFKA-14534: Reduce flakiness in TransactionsExpirationTest

GitBox Thu, 22 Dec 2022 09:41:09 -0800


gharris1727 commented on code in PR #13036:
URL: https://github.com/apache/kafka/pull/13036#discussion_r1055697654



##########
core/src/test/scala/integration/kafka/api/TransactionsExpirationTest.scala:
##########
@@ -205,9 +211,9 @@ class TransactionsExpirationTest extends 
KafkaServerTestHarness {
     serverProps.put(KafkaConfig.AutoLeaderRebalanceEnableProp, false.toString)
     serverProps.put(KafkaConfig.GroupInitialRebalanceDelayMsProp, "0")
     
serverProps.put(KafkaConfig.TransactionsAbortTimedOutTransactionCleanupIntervalMsProp,
 "200")
-    serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "1000")
+    serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "10000")

Review Comment:
   > was this much of an increase required? 
   
   The larger the timeouts, the more forgiving the test. For example the given 
1000ms timeout may have the following success rates:
   * 99% success rate on a developer laptop
   * 70% success rate in CI. (15 failures out of 50 runs in 25 builds)
   * 50% success rate when simulating a 0.3 CPU environment
   
   After changing the timeouts I was seeing >90% success rate in my 0.3 CPU 
environment (i didn't run it too much to confirm), which should hopefully be 
enough to bring up the CI success rate as well.
   
   Below 0.3 CPU the test flaked out much more severely, like the cluster 
wouldn't come all the way up, etc. If the test logic is as de-flaked as the 
infrastructure itself then I consider that good enough.
   
   > Next time I should write tests that use mock time
   
   Yes, if possible you should always prefer a mock time test. When I'm 
debugging flakey tests, they're _nearly always_ flakey because of time passing 
strangely. In a mocked time environment you are completely isolated from the 
speed of the JVM.
   
   One last off-topic comment: When writing a test, think about "if someone 
pressed pause on this part of the test for a long time, would the test still 
pass?" Because that's what the effect of a CPU limited environment is: threads 
are de-scheduled and stop executing, effectively pausing in between lines of 
code.
   
   In this case, the test paused between the producer calls and the 
producerState call for more than a second, and the expiration on the broker 
fired first. Now it's much less likely that it will be paused for more than 10 
seconds, but there's still always a chance. If this was a mocked time 
environment, and you could be certain that the producerState would be executed 
before the transaction expiration, and there wouldn't be any room for flakiness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] gharris1727 commented on a diff in pull request #13036: KAFKA-14534: Reduce flakiness in TransactionsExpirationTest

Reply via email to