[ https://issues.apache.org/jira/browse/KAFKA-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905778#comment-15905778 ]
Apurva Mehta commented on KAFKA-4574: ------------------------------------- It appears than different messages wound up at the same offset: {noformat} amehta-macbook-pro:50 apurva$ grep 4738 VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep producer_send_success {"topic":"test_topic","partition":2,"name":"producer_send_success","value":"4738","time_ms":1489036862544,"offset":1496,"key":null} amehta-macbook-pro:50 apurva$ grep 5614 VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep producer_send_success {"topic":"test_topic","partition":2,"name":"producer_send_success","value":"5614","time_ms":1489036872096,"offset":1496,"key":null} {noformat} Note that two unique messages were acked at offset 1496 in partition 2. In the log, only the second message (by timestamp) persists. So the first message was lost. What happened during the test is that the consumer read offset 1496 which had the first value at that time, and then moved on. So the second message was never read, and was reported missing. This may be a scenario where KIP-101 would solve the problem? The brokers are being continually bounced in the test, so it is possible that once in a while the wrong truncation happens and messages are lost. Another interesting finding: the producer received acks for 6425 messages. But there are 6427 unique messages in the log, and the consumer also read 6427 unique messages. So some unacked messages seem to have made it into the log. There were three messages which were not acked, but are in the log, and all of them have this error: {noformat} amehta-macbook-pro:50 apurva$ grep 4728 VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep producer_send_error {"topic":"test_topic","message":"This server is not the leader for that topic-partition.","exception":"class org.apache.kafka.common.errors.NotLeaderForPartitionException","name":"producer_send_error","value":"4728","time_ms":1489036862629,"key":null} amehta-macbook-pro:50 apurva$ grep 2844 VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep producer_send_error {"topic":"test_topic","message":"Messages are written to the log, but to fewer in-sync replicas than required.","exception":"class org.apache.kafka.common.errors.NotEnoughReplicasAfterAppendException","name":"producer_send_error","value":"2844","time_ms":1489036843590,"key":null} amehta-macbook-pro:50 apurva$ grep 5588 VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep producer_send_error {"topic":"test_topic","message":"Messages are written to the log, but to fewer in-sync replicas than required.","exception":"class org.apache.kafka.common.errors.NotEnoughReplicasAfterAppendException","name":"producer_send_error","value":"5588","time_ms":1489036871054,"key":null} {noformat} This also seems problematic, since the producer config shows acks=-1, and retries=0. So the producer considered this a failure, but the messages wound up in the log. This also seems like a problem, but it is not the cause of this test failure. > Transient failure in ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade > with security_protocol = SASL_PLAINTEXT, SSL > ----------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-4574 > URL: https://issues.apache.org/jira/browse/KAFKA-4574 > Project: Kafka > Issue Type: Test > Components: system tests > Reporter: Shikhar Bhushan > Assignee: Apurva Mehta > > http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-29--001.1483003056--apache--trunk--dc55025/report.html > {{ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade}} failed with these > {{security_protocol}} parameters > {noformat} > ==================================================================================================== > test_id: > kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SASL_PLAINTEXT > status: FAIL > run time: 3 minutes 44.094 seconds > 1 acked message did not make it to the Consumer. They are: [5076]. We > validated that the first 1 of these missing messages correctly made it into > Kafka's data files. This suggests they were lost on their way to the consumer. > Traceback (most recent call last): > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py", > line 117, in test_zk_security_upgrade > self.run_produce_consume_validate(self.run_zk_migration) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 101, in run_produce_consume_validate > self.validate() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 163, in validate > assert success, msg > AssertionError: 1 acked message did not make it to the Consumer. They are: > [5076]. We validated that the first 1 of these missing messages correctly > made it into Kafka's data files. This suggests they were lost on their way to > the consumer. > {noformat} > {noformat} > ==================================================================================================== > test_id: > kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SSL > status: FAIL > run time: 3 minutes 50.578 seconds > 1 acked message did not make it to the Consumer. They are: [3559]. We > validated that the first 1 of these missing messages correctly made it into > Kafka's data files. This suggests they were lost on their way to the consumer. > Traceback (most recent call last): > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 123, in run > data = self.run_test() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py", > line 176, in run_test > return self.test_context.function(self.test) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py", > line 321, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py", > line 117, in test_zk_security_upgrade > self.run_produce_consume_validate(self.run_zk_migration) > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 101, in run_produce_consume_validate > self.validate() > File > "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py", > line 163, in validate > assert success, msg > AssertionError: 1 acked message did not make it to the Consumer. They are: > [3559]. We validated that the first 1 of these missing messages correctly > made it into Kafka's data files. This suggests they were lost on their way to > the consumer. > {noformat} > Previously: KAFKA-3985 -- This message was sent by Atlassian JIRA (v6.3.15#6346)