[jira] [Commented] (KAFKA-3039) Temporary loss of leader resulted in log being completely truncated

Hanlin Liu (JIRA) Fri, 23 Jun 2017 03:43:05 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060711#comment-16060711
 ]


Hanlin Liu commented on KAFKA-3039:
-----------------------------------

We are seeing similar issue when restarting KAFKA 0.10.2.0

```
[2017-06-23 05:50:35,082] WARN Found a corrupted index file due to requirement 
failed: Corrupt index found, index file 
(/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index) has 
non-zero size but the last offset is 74220 which is no larger than the base 
offset 74220.}. deleting 
/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.timeindex, 
/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index and 
rebuilding index... (kafka.log.Log)
[2017-06-23 05:50:35,091] INFO Recovering unflushed segment 74220 in log 
__consumer_offsets-22. (kafka.log.Log)
[2017-06-23 05:50:35,095] INFO Completed load of log __consumer_offsets-22 with 
2 log segments and log end offset 75825 in 35 ms (kafka.log.Log)
[2017-06-23 06:14:57,133] TRACE Controller 1 epoch 1 sending UpdateMetadata 
request (Leader:2,ISR:2,3,1,LeaderEpoch:0,ControllerEpoch:1) to brokers Set(1, 
2, 3) for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,482] TRACE Broker 1 handling LeaderAndIsr request 
correlationId 7 from controller 1 epoch 1 starting the become-follower 
transition for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,587] INFO Partition [__consumer_offsets,22] on broker 1: 
No checkpointed highwatermark is found for partition __consumer_offsets-22 
(kafka.cluster.Partition)
[2017-06-23 06:14:57,606] INFO [ReplicaFetcherManager on broker 1] Removed 
fetcher for partitions 
__consumer_offsets-28,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-14,__consumer_offsets-40,__consumer_offsets-37,__consumer_offsets-22,__consumer_offsets-41,__consumer_offsets-4,__consumer_offsets-23,__consumer_offsets-26,__consumer_offsets-8,__consumer_offsets-49,__consumer_offsets-31,__consumer_offsets-13,__consumer_offsets-35,__consumer_offsets-17,__consumer_offsets-43,__consumer_offsets-25,__consumer_offsets-44,__consumer_offsets-47,__consumer_offsets-7,__consumer_offsets-29,__consumer_offsets-11,__consumer_offsets-34,__consumer_offsets-19,__consumer_offsets-16,__consumer_offsets-38,__consumer_offsets-1,__consumer_offsets-20,__consumer_offsets-5,__consumer_offsets-46,__consumer_offsets-2
 (kafka.server.ReplicaFetcherManager)
[2017-06-23 06:14:57,606] TRACE Broker 1 stopped fetchers as part of 
become-follower request from controller 1 epoch 1 with correlation id 7 for 
partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,608] INFO The cleaning for partition __consumer_offsets-22 
is aborted and paused (kafka.log.LogCleaner)
[2017-06-23 06:14:57,608] INFO Truncating log __consumer_offsets-22 to offset 
0. (kafka.log.Log)
[2017-06-23 06:14:57,609] INFO Scheduling log segment 42456 for log 
__consumer_offsets-22 for deletion. (kafka.log.Log)
[2017-06-23 06:14:57,609] INFO Scheduling log segment 74220 for log 
__consumer_offsets-22 for deletion. (kafka.log.Log)
[2017-06-23 06:14:57,611] INFO Compaction for partition __consumer_offsets-22 
is resumed (kafka.log.LogCleaner)
[2017-06-23 06:14:57,904] TRACE Broker 1 truncated logs and checkpointed 
recovery boundaries for partition __consumer_offsets-22 as part of 
become-follower request with correlation id 7 from controller 1 epoch 1 
(state.change.logger)
[2017-06-23 06:14:57,910] INFO [ReplicaFetcherManager on broker 1] Added 
fetcher for partitions List([__consumer_offsets-22, initOffset 0 to broker 
BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-8, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-4, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , 
[__consumer_offsets-7, initOffset 0 to broker 
BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-46, initOffset 0 to 
broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-25, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , 
[__consumer_offsets-35, initOffset 0 to broker 
BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-41, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-23, 
initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , 
[__consumer_offsets-49, initOffset 0 to broker 
BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-47, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-16, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , 
[__consumer_offsets-28, initOffset 0 to broker 
BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-31, initOffset 0 to 
broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-37, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , 
[__consumer_offsets-38, initOffset 0 to broker 
BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-17, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-19, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , 
[__consumer_offsets-11, initOffset 0 to broker 
BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-13, initOffset 0 to 
broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-2, 
initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , 
[__consumer_offsets-43, initOffset 0 to broker 
BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-14, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-20, 
initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , 
[__consumer_offsets-44, initOffset 0 to broker 
BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-1, initOffset 0 to 
broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-5, 
initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , 
[__consumer_offsets-26, initOffset 0 to broker 
BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-29, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-34, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , 
[__consumer_offsets-10, initOffset 0 to broker 
BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-32, initOffset 0 to 
broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-40, 
initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] ) 
(kafka.server.ReplicaFetcherManager)
[2017-06-23 06:14:57,910] TRACE Broker 1 started fetcher to new leader as part 
of become-follower request from controller 1 epoch 1 with correlation id 7 for 
partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,911] TRACE Broker 1 completed LeaderAndIsr request 
correlationId 7 from controller 1 epoch 1 for the become-follower transition 
for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,922] TRACE Broker 1 cached leader info 
(LeaderAndIsrInfo:(Leader:2,ISR:2,3,1,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:3),AllReplicas:1,2,3)
 for partition __consumer_offsets-22 in response to UpdateMetadata request sent 
by controller 1 epoch 1 with correlation id 8 (state.change.logger)
[2017-06-23 06:15:57,610] INFO Deleting segment 42456 from log 
__consumer_offsets-22. (kafka.log.Log)
[2017-06-23 06:15:57,610] INFO Deleting segment 74220 from log 
__consumer_offsets-22. (kafka.log.Log)
[2017-06-23 06:15:57,612] INFO Deleting index 
/var/lib/kafka/data/__consumer_offsets-22/00000000000000042456.index.deleted 
(kafka.log.OffsetIndex)
[2017-06-23 06:15:57,613] INFO Deleting index 
/var/lib/kafka/data/__consumer_offsets-22/00000000000000042456.timeindex.deleted
 (kafka.log.TimeIndex)
[2017-06-23 06:15:58,311] INFO Deleting index 
/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index.deleted 
(kafka.log.OffsetIndex)
[2017-06-23 06:15:58,912] INFO Deleting index 
/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.timeindex.deleted
 (kafka.log.TimeIndex)



```
INFO KafkaConfig values: 
        advertised.host.name = null
        advertised.listeners = PLAINTEXT://192.168.1.214:9092
        advertised.port = null
        authorizer.class.name = 
        auto.create.topics.enable = true
        auto.leader.rebalance.enable = true
        background.threads = 10
        broker.id = 1
        broker.id.generation.enable = true
        broker.rack = null
        compression.type = producer
        connections.max.idle.ms = 600000
        controlled.shutdown.enable = true
        controlled.shutdown.max.retries = 3
        controlled.shutdown.retry.backoff.ms = 5000
        controller.socket.timeout.ms = 30000
        create.topic.policy.class.name = null
        default.replication.factor = 3
        delete.topic.enable = false
        fetch.purgatory.purge.interval.requests = 1000
        group.max.session.timeout.ms = 300000
        group.min.session.timeout.ms = 6000
        host.name = 
        inter.broker.listener.name = null
        inter.broker.protocol.version = 0.10.2-IV0
        leader.imbalance.check.interval.seconds = 300
        leader.imbalance.per.broker.percentage = 10
        listener.security.protocol.map = 
SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,TRACE:TRACE,SASL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT
        listeners = PLAINTEXT://0.0.0.0:9092
        log.cleaner.backoff.ms = 15000
        log.cleaner.dedupe.buffer.size = 134217728
        log.cleaner.delete.retention.ms = 86400000
        log.cleaner.enable = true
        log.cleaner.io.buffer.load.factor = 0.9
        log.cleaner.io.buffer.size = 524288
        log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
        log.cleaner.min.cleanable.ratio = 0.5
        log.cleaner.min.compaction.lag.ms = 0
        log.cleaner.threads = 1
        log.cleanup.policy = [delete]
        log.dir = /tmp/kafka-logs
        log.dirs = /var/lib/kafka/data
        log.flush.interval.messages = 9223372036854775807
        log.flush.interval.ms = null
        log.flush.offset.checkpoint.interval.ms = 60000
        log.flush.scheduler.interval.ms = 9223372036854775807
        log.index.interval.bytes = 4096
        log.index.size.max.bytes = 10485760
        log.message.format.version = 0.10.2-IV0
        log.message.timestamp.difference.max.ms = 9223372036854775807
        log.message.timestamp.type = CreateTime
        log.preallocate = false
        log.retention.bytes = 10000000000
        log.retention.check.interval.ms = 300000
        log.retention.hours = 168
        log.retention.minutes = null
        log.retention.ms = null
        log.roll.hours = 168
        log.roll.jitter.hours = 0
        log.roll.jitter.ms = null
        log.roll.ms = null
        log.segment.bytes = 1073741824
        log.segment.delete.delay.ms = 60000
        max.connections.per.ip = 2147483647
        max.connections.per.ip.overrides = 
        message.max.bytes = 2000000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        min.insync.replicas = 1
        num.io.threads = 8
        num.network.threads = 3
        num.partitions = 1
        num.recovery.threads.per.data.dir = 1
        num.replica.fetchers = 1
        offset.metadata.max.bytes = 4096
        offsets.commit.required.acks = -1
        offsets.commit.timeout.ms = 5000
        offsets.load.buffer.size = 5242880
        offsets.retention.check.interval.ms = 600000
        offsets.retention.minutes = 1440
        offsets.topic.compression.codec = 0
        offsets.topic.num.partitions = 50
        offsets.topic.replication.factor = 3
        offsets.topic.segment.bytes = 104857600
        port = 9092
        principal.builder.class = class 
org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
        producer.purgatory.purge.interval.requests = 1000
        queued.max.requests = 500
        quota.consumer.default = 9223372036854775807
        quota.producer.default = 9223372036854775807
        quota.window.num = 11
        quota.window.size.seconds = 1
        replica.fetch.backoff.ms = 1000
        replica.fetch.max.bytes = 3000000
        replica.fetch.min.bytes = 1
        replica.fetch.response.max.bytes = 10485760
        replica.fetch.wait.max.ms = 500
        replica.high.watermark.checkpoint.interval.ms = 5000
        replica.lag.time.max.ms = 10000
        replica.socket.receive.buffer.bytes = 65536
        replica.socket.timeout.ms = 30000
        replication.quota.window.num = 11
        replication.quota.window.size.seconds = 1
        request.timeout.ms = 30000
        reserved.broker.max.id = 1000
        sasl.enabled.mechanisms = [GSSAPI]
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.principal.to.local.rules = [DEFAULT]
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.mechanism.inter.broker.protocol = GSSAPI
        security.inter.broker.protocol = PLAINTEXT
        socket.receive.buffer.bytes = 102400
        socket.request.max.bytes = 104857600
        socket.send.buffer.bytes = 102400
        ssl.cipher.suites = null
        ssl.client.auth = none
        ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
        ssl.endpoint.identification.algorithm = null
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLS
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        unclean.leader.election.enable = true
        zookeeper.connect = zk1:2181,zk2:2181,zk3:2181
        zookeeper.connection.timeout.ms = null
        zookeeper.session.timeout.ms = 6000
        zookeeper.set.acl = false
        zookeeper.sync.time.ms = 2000

> Temporary loss of leader resulted in log being completely truncated
> -------------------------------------------------------------------
>
>                 Key: KAFKA-3039
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3039
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0.0
>         Environment: Debian 3.2.54-2 x86_64 GNU/Linux
>            Reporter: Imran Patel
>            Priority: Critical
>              Labels: reliability
>
> We had an event recently where the temporarily loss of a leader for a 
> partition (during a manual restart), resulted in the leader coming back with 
> no high watermark state and truncating its log to zero. Logs (attached below) 
> indicate that it did have the data but not the commit state. How is this 
> possible?
> Leader (broker 3)
> [2015-12-18 21:19:44,666] INFO Completed load of log messages-14 with log end 
> offset 14175963374 (kafka.log.Log)
> [2015-12-18 21:19:45,170] INFO Partition [messages,14] on broker 3: No 
> checkpointed highwatermark is found for partition [messages,14] 
> (kafka.cluster.Partition)
> [2015-12-18 21:19:45,238] INFO Truncating log messages-14 to offset 0. 
> (kafka.log.Log)
> [2015-12-18 21:20:34,066] INFO Partition [messages,14] on broker 3: Expanding 
> ISR for partition [messages,14] from 3 to 3,10 (kafka.cluster.Partition)
> Replica (broker 10)
> [2015-12-18 21:19:19,525] INFO Partition [messages,14] on broker 10: 
> Shrinking ISR for partition [messages,14] from 3,10,4 to 10,4 
> (kafka.cluster.Partition)
> [2015-12-18 21:20:34,049] ERROR [ReplicaFetcherThread-0-3], Current offset 
> 14175984203 for partition [messages,14] out of range; reset offset to 35977 
> (kafka.server.ReplicaFetcherThread)
> [2015-12-18 21:20:34,033] WARN [ReplicaFetcherThread-0-3], Replica 10 for 
> partition [messages,14] reset its fetch offset from 14175984203 to current 
> leader 3's latest offset 35977 (kafka.server.ReplicaFetcherThread)
> Some relevant config parameters:
>         offsets.topic.replication.factor = 3
>         offsets.commit.required.acks = -1
>         replica.high.watermark.checkpoint.interval.ms = 5000
>         unclean.leader.election.enable = false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-3039) Temporary loss of leader resulted in log being completely truncated

Reply via email to