[ https://issues.apache.org/jira/browse/KAFKA-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890661#comment-15890661 ]
Ben Stopford edited comment on KAFKA-4825 at 3/2/17 2:55 PM: ------------------------------------------------------------- This could be a result of KIP-101 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation) although the use of clean shutdown in the test makes it less likely. was (Author: benstopford): This could be a result of KIP-101 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation) > Likely Data Loss in ReassignPartitionsTest System Test > ------------------------------------------------------ > > Key: KAFKA-4825 > URL: https://issues.apache.org/jira/browse/KAFKA-4825 > Project: Kafka > Issue Type: Bug > Reporter: Ben Stopford > Attachments: problem.zip > > > A failure in the below test may imply to a genuine missing message. > kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.security_protocol=PLAINTEXT > The test - which reassigns partition whilst bouncing cluster members - > reconciles messages ack'd with messages received in the consumer. > The interesting part is that we received two ack's for the same offset, with > different messages: > {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7447","time_ms":1488349980718,"offset":372,"key":null} > {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7487","time_ms":1488349981780,"offset":372,"key":null} > When searching the log files, via kafka.tools.DumpLogSegments, only the later > message is found. > The missing message lies midway through the test and appears to occur after a > leader moves (after 7447 is sent there is a ~1s pause, then 7487 is sent, > along with a backlog of messages for partitions 11, 16, 6). > The overall implication is a message appears to be acknowledged but later > lost. > Looking at the test itself it seems valid. The producer is initialised with > acks = -1. The callback checks for an exception in the onCompletion callback > and uses this to track acknowledgement in the test. > https://jenkins.confluent.io/job/system-test-kafka/521/console > http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2017-03-01--001.1488363091--apache--trunk--c9872cb/ReassignPartitionsTest/test_reassign_partitions/bounce_brokers=True.security_protocol=PLAINTEXT/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)