[jira] [Created] (KAFKA-3694) Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol

Jason Gustafson (JIRA) Tue, 10 May 2016 16:51:09 -0700

Jason Gustafson created KAFKA-3694:
--------------------------------------

             Summary: Transient system test failure 
ReplicationTest.test_replication_with_broker_failure.security_protocol
                 Key: KAFKA-3694
                 URL: https://issues.apache.org/jira/browse/KAFKA-3694
             Project: Kafka
          Issue Type: Bug
          Components: system tests
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson



We've seen this failure in several recent builds:

{code}
====================================================================================================
test_id:    
2016-05-10--001.kafkatest.tests.core.replication_test.ReplicationTest.test_replication_with_broker_failure.security_protocol=PLAINTEXT.failure_mode=hard_bounce.broker_type=leader
status:     FAIL
run time:   2 minutes 1.184 seconds


    Kafka server didn't finish startup
Traceback (most recent call last):
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py",
 line 106, in run_all_tests
    data = self.run_single_test()
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/tests/runner.py",
 line 162, in run_single_test
    return self.current_test_context.function(self.current_test)
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.5.0-py2.7.egg/ducktape/mark/_mark.py",
 line 331, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/replication_test.py",
 line 157, in test_replication_with_broker_failure
    self.run_produce_consume_validate(core_test_action=lambda: 
failures[failure_mode](self, broker_type))
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
 line 79, in run_produce_consume_validate
    raise e
TimeoutError: Kafka server didn't finish startup
{code}

After some investigation, the problem seems to be caused by an unexpected 
partition leader change which is triggered proactively by the controller when 
the preferred leader becomes alive again. The test currently assumes that it is 
safe to restart the broker as soon as it observes a leadership change since 
this is typically caused by a Zk session timeout. However, in this case, the 
session hasn't actually expired when the leadership change occurs. So after 
starting up, the broker sees its brokerId still registered and immediately 
shuts down, which causes the test failure above. To fix the problem, we should 
probably have a stronger check to ensure that the broker has actually been 
deregesitered from Zk prior to restarting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-3694) Transient system test failure ReplicationTest.test_replication_with_broker_failure.security_protocol

Reply via email to