[ https://issues.apache.org/jira/browse/KAFKA-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949526#comment-15949526 ]
Jason Gustafson commented on KAFKA-4689: ---------------------------------------- I think I finally see what is happening here. I looked at this test case: http://confluent-systest.s3-website-us-west-2.amazonaws.com/confluent-kafka-0-10-2-system-test-results/?prefix=2017-03-30--001.1490888066--apache--0.10.2--2397269/. Here is the assertion error: {code} AssertionError: Current position 30897 greater than the total number of consumed records 30892 {code} We are missing 5 records apparently. Looking in the test logs, I found this warning: {code} [WARNING - 2017-03-30 11:01:19,615 - verifiable_consumer - _update_global_position - lineno:222]: Expected next consumed offset of 20037 for partition TopicPartition(topic=u'test_topic', partition=0), but instead saw 20042 {code} This suggests a gap in the consumed data. However, in the event output, we see clearly that these offsets were consumed: {code} {"timestamp":1490871669110,"count":5,"name":"records_consumed","partitions":[{"topic":"test_topic","partition":0,"count":5,"minOffset":20037,"maxOffset":20041}]} {code} And from the consumer log, we can see the offset is committed, though we do not see the offset commit in the event output: {code} [2017-03-30 11:01:09,110] TRACE Sending OffsetCommit request with {test_topic-0=OffsetAndMetadata{offset=20042, metadata=''}} to coordinator worker9:9092 (id: 2147483646 rack: null) for group test_group_id (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} We can conclude that both the "records_consumed" event and the "offsets_committed" event were not delivered to the test framework prior to the hard failure. In general, it seems difficult to write reliable hard failure test cases which depends on event output from the process experiencing the hard failure. There is always a delay between the occurrence of the event and its emission and delivery. We probably need to weaken the test assertions to deal with the possibility of having missed events, but it's possible that the test is longer be useful if we do so. For the time being, I think we should disable the hard failure scenarios. > OffsetValidationTest fails validation with "Current position greater than the > total number of consumed records" > --------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-4689 > URL: https://issues.apache.org/jira/browse/KAFKA-4689 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, system tests > Reporter: Ewen Cheslack-Postava > Assignee: Jason Gustafson > Labels: system-test-failure > > {quote} > ==================================================================================================== > test_id: > kafkatest.tests.client.consumer_test.OffsetValidationTest.test_consumer_bounce.clean_shutdown=False.bounce_mode=all > status: FAIL > run time: 1 minute 49.834 seconds > Current position greater than the total number of consumed records > Traceback (most recent call last): > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/client/consumer_test.py", > line 157, in test_consumer_bounce > "Current position greater than the total number of consumed records" > AssertionError: Current position greater than the total number of consumed > records > {quote} > See also > https://issues.apache.org/jira/browse/KAFKA-3513?focusedCommentId=15791790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15791790 > which is another instance of this bug, which indicates the issue goes back > at least as far as 1/17/2017. Note that I don't think we've seen this in > 0.10.1 yet. -- This message was sent by Atlassian JIRA (v6.3.15#6346)