gharris1727 commented on code in PR #13036: URL: https://github.com/apache/kafka/pull/13036#discussion_r1055697654
########## core/src/test/scala/integration/kafka/api/TransactionsExpirationTest.scala: ########## @@ -205,9 +211,9 @@ class TransactionsExpirationTest extends KafkaServerTestHarness { serverProps.put(KafkaConfig.AutoLeaderRebalanceEnableProp, false.toString) serverProps.put(KafkaConfig.GroupInitialRebalanceDelayMsProp, "0") serverProps.put(KafkaConfig.TransactionsAbortTimedOutTransactionCleanupIntervalMsProp, "200") - serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "1000") + serverProps.put(KafkaConfig.TransactionalIdExpirationMsProp, "10000") Review Comment: > was this much of an increase required? The larger the timeouts, the more forgiving the test. For example the given 1000ms timeout may have the following success rates: * 99% success rate on a developer laptop * 70% success rate in CI. (15 failures out of 50 runs in 25 builds) * 50% success rate when simulating a 0.3 CPU environment After changing the timeouts I was seeing >90% success rate in my 0.3 CPU environment (i didn't run it too much to confirm), which should hopefully be enough to bring up the CI success rate as well. Below 0.3 CPU the test flaked out much more severely, like the cluster wouldn't come all the way up, etc. If the test logic is as de-flaked as the infrastructure itself then I consider that good enough. > Next time I should write tests that use mock time Yes, if possible you should always prefer a mock time test. When I'm debugging flakey tests, they're _nearly always_ flakey because of time passing strangely. In a mocked time environment you are completely isolated from the speed of the JVM. One last off-topic comment: When writing a test, think about "if someone pressed pause on this part of the test for a long time, would the test still pass?" Because that's what the effect of a CPU limited environment is: threads are de-scheduled and stop executing, effectively pausing in between lines of code. In this case, the test paused between the producer calls and the producerState call for more than a second, and the expiration on the broker fired first. Now it's much less likely that it will be paused for more than 10 seconds, but there's still always a chance. If this was a mocked time environment, and you could be certain that the producerState would be executed before the transaction expiration, and there wouldn't be any room for flakiness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org