[jira] [Commented] (KAFKA-3518) Transient failure in ReplicationTest.test_replication_with_broker_failure with ConsumerTimeoutException

Jun Rao (JIRA) Thu, 04 Aug 2016 18:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408726#comment-15408726
 ]


Jun Rao commented on KAFKA-3518:
--------------------------------

Took a look at the log. The following are my findings.

>From report.txt, it shows the following messages are missing: 33867, 33870, 
>33873, 33876, 33879, 33882, 33885

Grepping those message from the producer output show the following. All the 
missing messages are consecutive in the log of test_topic partition 2.
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33867","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11289,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33870","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11290,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33873","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11291,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33876","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11292,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33879","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11293,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33882","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11294,"key":null}
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"33885","class":"class
 
org.apache.kafka.tools.VerifiableProducer","time_ms":1459946788801,"offset":11295,"key":null}

The broker logs show the following.
worker9:
[2016-04-06 12:46:28,812] INFO Truncating log test_topic-2 to offset 11296. 
(kafka.log.Log)
[2016-04-06 12:46:36,867] INFO Truncating log test_topic-2 to offset 11296. 
(kafka.log.Log)
[2016-04-06 12:46:39,032] INFO Truncating log test_topic-2 to offset 11296. 
(kafka.log.Log)
[2016-04-06 12:46:39,319] INFO Truncating log test_topic-2 to offset 11289. 
(kafka.log.Log)
[2016-04-06 12:48:33,940] INFO Truncating log test_topic-2 to offset 19965. 
(kafka.log.Log)


worker10:
[2016-04-06 12:45:47,104] INFO Truncating log test_topic-2 to offset 0. 
(kafka.log.Log)
[2016-04-06 12:46:16,088] INFO Truncating log test_topic-2 to offset 3952. 
(kafka.log.Log)
[2016-04-06 12:46:54,141] INFO Truncating log test_topic-2 to offset 11289. 
(kafka.log.Log)
[2016-04-06 12:46:56,494] INFO Truncating log test_topic-2 to offset 16968. 
(kafka.log.Log)


worker 1:
[2016-04-06 12:45:47,108] INFO Truncating log test_topic-2 to offset 0. 
(kafka.log.Log)
[2016-04-06 12:46:26,867] INFO Truncating log test_topic-2 to offset 7790. 
(kafka.log.Log)
[2016-04-06 12:46:28,849] INFO Truncating log test_topic-2 to offset 11289. 
(kafka.log.Log)
[2016-04-06 12:46:56,489] INFO Truncating log test_topic-2 to offset 16972. 
(kafka.log.Log)
[2016-04-06 12:47:03,900] INFO Truncating log test_topic-2 to offset 16972. 
(kafka.log.Log)
[2016-04-06 12:48:33,954] INFO Truncating log test_topic-2 to offset 19965. 
(kafka.log.Log)

Offset 11289 is truncated a few times. Currently, there is a known issue 
(KAFKA-1211) that if the follower becomes the leader very quickly after 
truncation, some previously committed messages could be lost. This can happen 
if the broker are restarted very quickly, which doesn't seem to be the case 
from the test_log.debug.

[DEBUG - 2016-04-06 12:46:05,764 - remoteaccount - _ssh_quiet - lineno:213]: 
Trying to run remote command: ssh ubuntu@worker10 -o 'HostName 
ec2-54-187-119-87.us-west-2.compute.amazonaws.com' -o 'Port 22' -o 
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o 
'IdentitiesOnly yes' -o 'LogLevel FATAL'  'kill -15 28525'
[DEBUG - 2016-04-06 12:46:17,411 - remoteaccount - _ssh_quiet - lineno:213]: 
Trying to run remote command: ssh ubuntu@worker1 -o 'HostName 
ec2-54-213-143-217.us-west-2.compute.amazonaws.com' -o 'Port 22' -o 
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o 
'IdentitiesOnly yes' -o 'LogLevel FATAL'  'kill -15 7855'
[DEBUG - 2016-04-06 12:46:28,122 - remoteaccount - _ssh_quiet - lineno:213]: 
Trying to run remote command: ssh ubuntu@worker9 -o 'HostName 
ec2-54-200-156-141.us-west-2.compute.amazonaws.com' -o 'Port 22' -o 
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o 
'IdentitiesOnly yes' -o 'LogLevel FATAL'  'kill -15 8372'
[DEBUG - 2016-04-06 12:46:38,291 - remoteaccount - _ssh_quiet - lineno:213]: 
Trying to run remote command: ssh ubuntu@worker10 -o 'HostName 
ec2-54-187-119-87.us-west-2.compute.amazonaws.com' -o 'Port 22' -o 
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o 
'IdentitiesOnly yes' -o 'LogLevel FATAL'  'kill -15 30050'
[DEBUG - 2016-04-06 12:46:55,532 - remoteaccount - _ssh_quiet - lineno:213]: 
Trying to run remote command: ssh ubuntu@worker1 -o 'HostName 
ec2-54-213-143-217.us-west-2.compute.amazonaws.com' -o 'Port 22' -o 
'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 
'PasswordAuthentication no' -o 'IdentityFile /var/lib/jenkins/muckrake.pem' -o 
'IdentitiesOnly yes' -o 'LogLevel FATAL'  'kill -15 9259'

Another possibility is due to KAFKA-3670 during preferred leader election. I 
wasn't able to fully verify if this is indeed the case since we didn't enable 
TRACE level logging in the controller and state-change.log where we log leader 
change information. Filed KAFKA-4021 to track that. 

We fixed KAFKA-3670 on May 8. I checked the system test history after May 8. 
This test only failed 2 more times. However, the failure is not due to message 
loss. In one case 
(http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-05-10--001.1462898929--apache--trunk--6f18732/report.html),
 the broker can't be started and in the other 
(http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-05-09--001.1462811506--apache--trunk--f5b98b8/report.html),
 the controller can't be identified. So, it seems this particular issue hasn't 
shown up again and is likely fixed by KAFKA-3670. I think we can close this 
jira for now.

The more complete fix will happen in KAFKA-1211, but it may take some time. 

> Transient failure in ReplicationTest.test_replication_with_broker_failure 
> with ConsumerTimeoutException
> -------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3518
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3518
>             Project: Kafka
>          Issue Type: Bug
>          Components: system tests
>            Reporter: Ewen Cheslack-Postava
>
> From 
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-04-06--001.1459956800--apache--trunk--99d2329/report.html
>  this test failed on a recent run:
> Module: kafkatest.tests.core.replication_test
> Class:  ReplicationTest
> Method: test_replication_with_broker_failure
> Arguments:
> {
>   "broker_type": "leader",
>   "failure_mode": "clean_bounce",
>   "security_protocol": "SASL_PLAINTEXT"
> }
> It is missing 7 messages on the consumer trying to validate the messages. The 
> consumer hit its timeout, which should be sufficiently long to capture any 
> successfully produced messages. I didn't notice anything obviously wrong in 
> the logs. Could be some sort of consumer stall?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3518) Transient failure in ReplicationTest.test_replication_with_broker_failure with ConsumerTimeoutException

Reply via email to