[ 
https://issues.apache.org/jira/browse/KAFKA-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905778#comment-15905778
 ] 

Apurva Mehta commented on KAFKA-4574:
-------------------------------------

It appears than different messages wound up at the same offset: 

{noformat}
amehta-macbook-pro:50 apurva$ grep 4738 
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep 
producer_send_success
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"4738","time_ms":1489036862544,"offset":1496,"key":null}
amehta-macbook-pro:50 apurva$ grep 5614 
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep 
producer_send_success
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"5614","time_ms":1489036872096,"offset":1496,"key":null}
{noformat}

Note that two unique messages were acked at offset 1496 in partition 2. In the 
log, only the second message (by timestamp) persists. So the first message was 
lost. What happened during the test is that the consumer read offset 1496 which 
had the first value at that time, and then moved on. So the second message was 
never read, and was reported missing. 

This may be a scenario where KIP-101 would solve the problem? The brokers are 
being continually bounced in the test, so it is possible that once in a while 
the wrong truncation happens and messages are lost. 

Another interesting finding: the producer received acks for 6425 messages. But 
there are 6427 unique messages in the log, and the consumer also read 6427 
unique messages. So some unacked messages seem to have made it into the log.

There were three messages which were not acked, but are in the log, and all of 
them have this error: 

{noformat}
amehta-macbook-pro:50 apurva$ grep 4728 
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep 
producer_send_error
{"topic":"test_topic","message":"This server is not the leader for that 
topic-partition.","exception":"class 
org.apache.kafka.common.errors.NotLeaderForPartitionException","name":"producer_send_error","value":"4728","time_ms":1489036862629,"key":null}
amehta-macbook-pro:50 apurva$ grep 2844 
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep 
producer_send_error
{"topic":"test_topic","message":"Messages are written to the log, but to fewer 
in-sync replicas than required.","exception":"class 
org.apache.kafka.common.errors.NotEnoughReplicasAfterAppendException","name":"producer_send_error","value":"2844","time_ms":1489036843590,"key":null}
amehta-macbook-pro:50 apurva$ grep 5588 
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep 
producer_send_error
{"topic":"test_topic","message":"Messages are written to the log, but to fewer 
in-sync replicas than required.","exception":"class 
org.apache.kafka.common.errors.NotEnoughReplicasAfterAppendException","name":"producer_send_error","value":"5588","time_ms":1489036871054,"key":null}
{noformat}

This also seems problematic, since the producer config shows acks=-1, and 
retries=0. So the producer considered this a failure, but the messages wound up 
in the log. This also seems like a problem, but it is not the cause of this 
test failure.

> Transient failure in ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade 
> with security_protocol = SASL_PLAINTEXT, SSL
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4574
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4574
>             Project: Kafka
>          Issue Type: Test
>          Components: system tests
>            Reporter: Shikhar Bhushan
>            Assignee: Apurva Mehta
>
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-29--001.1483003056--apache--trunk--dc55025/report.html
> {{ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade}} failed with these 
> {{security_protocol}} parameters 
> {noformat}
> ====================================================================================================
> test_id:    
> kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SASL_PLAINTEXT
> status:     FAIL
> run time:   3 minutes 44.094 seconds
>     1 acked message did not make it to the Consumer. They are: [5076]. We 
> validated that the first 1 of these missing messages correctly made it into 
> Kafka's data files. This suggests they were lost on their way to the consumer.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
>     data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
>     return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py",
>  line 117, in test_zk_security_upgrade
>     self.run_produce_consume_validate(self.run_zk_migration)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 101, in run_produce_consume_validate
>     self.validate()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 163, in validate
>     assert success, msg
> AssertionError: 1 acked message did not make it to the Consumer. They are: 
> [5076]. We validated that the first 1 of these missing messages correctly 
> made it into Kafka's data files. This suggests they were lost on their way to 
> the consumer.
> {noformat}
> {noformat}
> ====================================================================================================
> test_id:    
> kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SSL
> status:     FAIL
> run time:   3 minutes 50.578 seconds
>     1 acked message did not make it to the Consumer. They are: [3559]. We 
> validated that the first 1 of these missing messages correctly made it into 
> Kafka's data files. This suggests they were lost on their way to the consumer.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
>     data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
>     return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py",
>  line 117, in test_zk_security_upgrade
>     self.run_produce_consume_validate(self.run_zk_migration)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 101, in run_produce_consume_validate
>     self.validate()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 163, in validate
>     assert success, msg
> AssertionError: 1 acked message did not make it to the Consumer. They are: 
> [3559]. We validated that the first 1 of these missing messages correctly 
> made it into Kafka's data files. This suggests they were lost on their way to 
> the consumer.
> {noformat}
> Previously: KAFKA-3985



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to