[ 
https://issues.apache.org/jira/browse/KAFKA-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949526#comment-15949526
 ] 

Jason Gustafson commented on KAFKA-4689:
----------------------------------------

I think I finally see what is happening here. I looked at this test case: 
http://confluent-systest.s3-website-us-west-2.amazonaws.com/confluent-kafka-0-10-2-system-test-results/?prefix=2017-03-30--001.1490888066--apache--0.10.2--2397269/.
 

Here is the assertion error:
{code}
AssertionError: Current position 30897 greater than the total number of 
consumed records 30892
{code}

We are missing 5 records apparently. Looking in the test logs, I found this 
warning:
{code}
[WARNING - 2017-03-30 11:01:19,615 - verifiable_consumer - 
_update_global_position - lineno:222]: Expected next consumed offset of 20037 
for partition TopicPartition(topic=u'test_topic', partition=0), but instead saw 
20042
{code}

This suggests a gap in the consumed data. However, in the event output, we see 
clearly that these offsets were consumed:
{code}
{"timestamp":1490871669110,"count":5,"name":"records_consumed","partitions":[{"topic":"test_topic","partition":0,"count":5,"minOffset":20037,"maxOffset":20041}]}
{code}

And from the consumer log, we can see the offset is committed, though we do not 
see the offset commit in the event output:
{code}
[2017-03-30 11:01:09,110] TRACE Sending OffsetCommit request with 
{test_topic-0=OffsetAndMetadata{offset=20042, metadata=''}} to coordinator 
worker9:9092 (id: 2147483646 rack: null) for group test_group_id 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
{code}

We can conclude that both the "records_consumed" event and the 
"offsets_committed" event were not delivered to the test framework prior to the 
hard failure. In general, it seems difficult to write reliable hard failure 
test cases which depends on event output from the process experiencing the hard 
failure. There is always a delay between the occurrence of the event and its 
emission and delivery. We probably need to weaken the test assertions to deal 
with the possibility of having missed events, but it's possible that the test 
is longer be useful if we do so. For the time being, I think we should disable 
the hard failure scenarios.

> OffsetValidationTest fails validation with "Current position greater than the 
> total number of consumed records"
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4689
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4689
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer, system tests
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Jason Gustafson
>              Labels: system-test-failure
>
> {quote}
> ====================================================================================================
> test_id:    
> kafkatest.tests.client.consumer_test.OffsetValidationTest.test_consumer_bounce.clean_shutdown=False.bounce_mode=all
> status:     FAIL
> run time:   1 minute 49.834 seconds
>     Current position greater than the total number of consumed records
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
>     data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
>     return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/client/consumer_test.py",
>  line 157, in test_consumer_bounce
>     "Current position greater than the total number of consumed records"
> AssertionError: Current position greater than the total number of consumed 
> records
> {quote}
> See also 
> https://issues.apache.org/jira/browse/KAFKA-3513?focusedCommentId=15791790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15791790
>  which is another instance of this bug, which indicates the issue goes back 
> at least as far as 1/17/2017. Note that I don't think we've seen this in 
> 0.10.1 yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to