[
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880684#comment-15880684
]
Rajini Sivaram commented on KAFKA-4779:
---------------------------------------
I couldn't recreate the failure, but the code for phase_two of security upgrade
with different client_protocol and broker_protocol is currently a disruptive
upgrade that stops produce and consume during the upgrade. As a result,
consumer can timeout if the upgrade takes slightly longer than expected.
Non-disruptive upgrade of cluster to enable new security protocols is described
in the docs (http://kafka.apache.org/documentation/#security_rolling_upgrade).
The new protocols must be enabled first with incremental bounce. And then the
inter-broker protocol is updated with incremental bounce. And finally, the old
protocol is removed. When client_protocol (SASL_PLAINTEXT) and broker_protocol
(SSL) are being updated to different protocols starting with PLAINTEXT, both
SASL_PLAINTEXT and SSL must be enabled first before inter-broker protocol is
changed to SSL. The test was enabling only SASL_PLAINTEXT. As a result
inter-broker communication was broken during the upgrade, causing produce and
consume to fail until the cluster got back to a good state. Since the purpose
of the test is to verify non-disruptive upgrade, I have changed the test to
enable both SASL_PLAINTEXT and SSL first so that the upgrade is performed
without disrupting producers or consumers.
> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> ----------------------------------------------------------------------------
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
> Issue Type: Bug
> Reporter: Apurva Mehta
> Assignee: Rajini Sivaram
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
> line 123, in run
> data = self.run_test()
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
> line 176, in run_test
> return self.test_context.function(self.test)
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
> line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
> line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings,
> client_protocol, broker_protocol)
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 87, in stop_producer_and_consumer
> self.check_alive()
> File
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out:
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating
> consumer process: (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the
> producer produced ~15k messages, of which ~7k were acked, and the consumer
> only got around ~2600 before timing out.
> There are a lot of messages like the following for different request types on
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in
> produce request on partition test_topic-0. The topic/partition may not exist
> or the user may not have Describe access to it
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)