[ https://issues.apache.org/jira/browse/KAFKA-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279449#comment-15279449 ]
ASF GitHub Bot commented on KAFKA-3694: --------------------------------------- GitHub user hachikuji opened a pull request: https://github.com/apache/kafka/pull/1365 KAFKA-3694: Ensure broker Zk deregistration prior to restart in ReplicationTest You can merge this pull request into a Git repository by running: $ git pull https://github.com/hachikuji/kafka KAFKA-3694 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/1365.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1365 ---- commit 50630af74bcf78abe2b3d2a5c07b3347a914543e Author: Jason Gustafson <ja...@confluent.io> Date: 2016-05-11T00:16:42Z KAFKA-3694: Ensure broker Zk deregistration prior to restart in ReplicationTest ---- > Transient system test failure > ReplicationTest.test_replication_with_broker_failure.security_protocol > ---------------------------------------------------------------------------------------------------- > > Key: KAFKA-3694 > URL: https://issues.apache.org/jira/browse/KAFKA-3694 > Project: Kafka > Issue Type: Bug > Components: system tests > Reporter: Jason Gustafson > Assignee: Jason Gustafson > > We've seen this failure in several recent builds: > {code} > ==================================================================================================== > test_id: > 2016-05-10--001.kafkatest.tests.core.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=PLAINTEXT.failure_mode=hard_bounce.broker_type=leader > status: FAIL > run time: 2 minutes 1.184 seconds > Kafka server didn't finish startup > Traceback (most recent call last): > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py", > line 106, in run_all_tests > data = self.run_single_test() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py", > line 162, in run_single_test > return self.current_test_context.function(self.current_test) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/mark/_mark.py", > line 331, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/replication_test.py", > line 157, in test_replication_with_broker_failure > self.run_produce_consume_validate(core_test_action=lambda: > failures[failure_mode](self, broker_type)) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in run_produce_consume_validate > raise e > TimeoutError: Kafka server didn't finish startup > {code} > After some investigation, the problem seems to be caused by an unexpected > partition leader change which is triggered proactively by the controller when > the preferred leader becomes alive again. The test currently assumes that it > is safe to restart the broker as soon as it observes a leadership change > since this is typically caused by a Zk session timeout. However, in this > case, the session hasn't actually expired when the leadership change occurs. > So after starting up, the broker sees its brokerId still registered and > immediately shuts down, which causes the test failure above. To fix the > problem, we should probably have a stronger check to ensure that the broker > has actually been deregesitered from Zk prior to restarting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)