[ https://issues.apache.org/jira/browse/KAFKA-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762361#comment-15762361 ]
Apurva Mehta commented on KAFKA-4526: ------------------------------------- I had a look at the logs from one of the failures, and here is the problem: # The test has two phases: one bulk producer phase, which seeds the topic with large enough quantities of data so that we can actually test throttled reassignment. The other phase is the regular produce-consume-validate loop. # We start the reassignment, and then run the produce-consume-validate loop to ensure that no new messages are lost during reassignment. # Because the produce-consume-validate pattern uses structured (integer) data in phase two, we require that the consumer start from the end of the log and also start before the producer begins producing messages. If this is true, then the consumer will read and validate all the messages sent by the producer. The test has a `wait_until` block, but that only checks for the existence of the process. # What is seen in the logs is that the producer starts and begins producing messages _before_ the consumer fetches the metadata for all the partitions. As as a result, the consumer misses the initial messages, which is consistent across all test failures. # This can be explained by the recent changes in ducktape: thanks to paramiko, running commands on worker machines is much faster since ssh connections are reused. Hence, the producer starts much faster than before, causing the initial set of messages to be missed by the consumer some of the time. # The fix is to avoid using the PID of the consumer as a proxy for 'the consumer is ready'. Something like 'partitions assigned' would be a more reliable proxy of the consumer being ready. Note that the original PR of the test had a timeout between consumer and producer start since there was no more robust method to ensure that the consumer was init'd before the producer started. But since the use of timeouts are --rightly!-- discouraged, it was removed. Adding suitable metrics would be a step in the right direction. # Next step is to leverage suitable metrics (like partitions assigned if it exists), or add them to the console consumer to ensure that it is init'd before continuing to start the producer. > Transient failure in ThrottlingTest.test_throttled_reassignment > --------------------------------------------------------------- > > Key: KAFKA-4526 > URL: https://issues.apache.org/jira/browse/KAFKA-4526 > Project: Kafka > Issue Type: Bug > Reporter: Ewen Cheslack-Postava > Assignee: Apurva Mehta > Labels: system-test-failure, system-tests > Fix For: 0.10.2.0 > > > This test is seeing transient failures sometimes > {quote} > Module: kafkatest.tests.core.throttling_test > Class: ThrottlingTest > Method: test_throttled_reassignment > Arguments: > { > "bounce_brokers": false > } > {quote} > This happens with both bounce_brokers = true and false. Fails with > {quote} > AssertionError: 1646 acked message did not make it to the Consumer. They are: > 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19...plus > 1626 more. Total Acked: 174799, Total Consumed: 173153. We validated that the > first 1000 of these missing messages correctly made it into Kafka's data > files. This suggests they were lost on their way to the consumer. > {quote} > See > http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-12--001.1481535295--apache--trunk--62e043a/report.html > for an example. > Note that there are a number of similar bug reports for different tests: > https://issues.apache.org/jira/issues/?jql=text%20~%20%22acked%20message%20did%20not%20make%20it%20to%20the%20Consumer%22%20and%20project%20%3D%20Kafka > I am wondering if we have a wrong ack setting somewhere that we should be > specifying as acks=all but is only defaulting to 0? > It also seems interesting that the missing messages in these recent failures > seem to always start at 0... -- This message was sent by Atlassian JIRA (v6.3.4#6332)