[jira] [Commented] (KAFKA-3694) Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol

ASF GitHub Bot (JIRA) Tue, 10 May 2016 20:45:49 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279449#comment-15279449
 ]


ASF GitHub Bot commented on KAFKA-3694:
---------------------------------------

GitHub user hachikuji opened a pull request:

    https://github.com/apache/kafka/pull/1365

    KAFKA-3694: Ensure broker Zk deregistration prior to restart in 
ReplicationTest

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hachikuji/kafka KAFKA-3694

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/1365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1365
    
----
commit 50630af74bcf78abe2b3d2a5c07b3347a914543e
Author: Jason Gustafson <ja...@confluent.io>
Date:   2016-05-11T00:16:42Z

    KAFKA-3694: Ensure broker Zk deregistration prior to restart in 
ReplicationTest

----


> Transient system test failure 
> ReplicationTest.test_replication_with_broker_failure.security_protocol
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3694
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3694
>             Project: Kafka
>          Issue Type: Bug
>          Components: system tests
>            Reporter: Jason Gustafson
>            Assignee: Jason Gustafson
>
> We've seen this failure in several recent builds:
> {code}
> ====================================================================================================
> test_id:    
> 2016-05-10--001.kafkatest.tests.core.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=PLAINTEXT.failure_mode=hard_bounce.broker_type=leader
> status:     FAIL
> run time:   2 minutes 1.184 seconds
>     Kafka server didn't finish startup
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py",
>  line 106, in run_all_tests
>     data = self.run_single_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py",
>  line 162, in run_single_test
>     return self.current_test_context.function(self.current_test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 331, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/replication_test.py",
>  line 157, in test_replication_with_broker_failure
>     self.run_produce_consume_validate(core_test_action=lambda: 
> failures[failure_mode](self, broker_type))
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in run_produce_consume_validate
>     raise e
> TimeoutError: Kafka server didn't finish startup
> {code}
> After some investigation, the problem seems to be caused by an unexpected 
> partition leader change which is triggered proactively by the controller when 
> the preferred leader becomes alive again. The test currently assumes that it 
> is safe to restart the broker as soon as it observes a leadership change 
> since this is typically caused by a Zk session timeout. However, in this 
> case, the session hasn't actually expired when the leadership change occurs. 
> So after starting up, the broker sees its brokerId still registered and 
> immediately shuts down, which causes the test failure above. To fix the 
> problem, we should probably have a stronger check to ensure that the broker 
> has actually been deregesitered from Zk prior to restarting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3694) Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol

Reply via email to