[ https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880684#comment-15880684 ]
Rajini Sivaram commented on KAFKA-4779: --------------------------------------- I couldn't recreate the failure, but the code for phase_two of security upgrade with different client_protocol and broker_protocol is currently a disruptive upgrade that stops produce and consume during the upgrade. As a result, consumer can timeout if the upgrade takes slightly longer than expected. Non-disruptive upgrade of cluster to enable new security protocols is described in the docs (http://kafka.apache.org/documentation/#security_rolling_upgrade). The new protocols must be enabled first with incremental bounce. And then the inter-broker protocol is updated with incremental bounce. And finally, the old protocol is removed. When client_protocol (SASL_PLAINTEXT) and broker_protocol (SSL) are being updated to different protocols starting with PLAINTEXT, both SASL_PLAINTEXT and SSL must be enabled first before inter-broker protocol is changed to SSL. The test was enabling only SASL_PLAINTEXT. As a result inter-broker communication was broken during the upgrade, causing produce and consume to fail until the cluster got back to a good state. Since the purpose of the test is to verify non-disruptive upgrade, I have changed the test to enable both SASL_PLAINTEXT and SSL first so that the upgrade is performed without disrupting producers or consumers. > Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py > ---------------------------------------------------------------------------- > > Key: KAFKA-4779 > URL: https://issues.apache.org/jira/browse/KAFKA-4779 > Project: Kafka > Issue Type: Bug > Reporter: Apurva Mehta > Assignee: Rajini Sivaram > > This test failed on 01/29, on both trunk and 0.10.2, error message: > {noformat} > The consumer has terminated, or timed out, on node ubuntu@worker3. > Traceback (most recent call last): > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py", > line 148, in test_rolling_upgrade_phase_two > self.run_produce_consume_validate(self.roll_in_secured_settings, > client_protocol, broker_protocol) > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 100, in run_produce_consume_validate > self.stop_producer_and_consumer() > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 87, in stop_producer_and_consumer > self.check_alive() > File > "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 79, in check_alive > raise Exception(msg) > Exception: The consumer has terminated, or timed out, on node ubuntu@worker3. > {noformat} > Looks like the console consumer times out: > {noformat} > [2017-01-30 04:56:00,972] ERROR Error processing message, terminating > consumer process: (kafka.tools.ConsoleConsumer$) > kafka.consumer.ConsumerTimeoutException > at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90) > at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120) > at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75) > at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50) > at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) > {noformat} > A bunch of these security_rolling_upgrade tests failed, and in all cases, the > producer produced ~15k messages, of which ~7k were acked, and the consumer > only got around ~2600 before timing out. > There are a lot of messages like the following for different request types on > the producer and consumer: > {noformat} > [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in > produce request on partition test_topic-0. The topic/partition may not exist > or the user may not have Describe access to it > (org.apache.kafka.clients.producer.internals.Sender) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)