Hello, Kafka Expert: I met some strange problem recently. Some topic may delete the whole partition data in one second, and recover later by replication. The the example of topic "ubt.mobilemonitor". It has 30 partitions, with replication 2. The retention.bytes is 157374182400(150G). The default segment is *1073741824(1G). *
*I want to know the reason, why the broker was scheduled to remove all segments files. It happened two or three times a week. The deleted partitions were random. No regular pattern found.* My Kafka Cluster has 16 broker nodes(8Core24G3T/per node), with 5 zookeeper nodes cluster. Have already created 18 topics, with the following broker setting: server.properties *############################# Server Basics #############################* *broker.id <http://broker.id>=10811396* *port=9092* *host.name <http://host.name>=10.8.113.96* *log.dirs=/data01/kafka,/data02/kafka,/data03/kafka,/data04/kafka,/data05/kafka,/data06/kafka* *# Replication configurations* *num.replica.fetchers=4* *replica.fetch.max.bytes=1048576* *replica.fetch.wait.max.ms <http://replica.fetch.wait.max.ms>=500* *replica.high.watermark.checkpoint.interval.ms <http://replica.high.watermark.checkpoint.interval.ms>=5000* *replica.socket.timeout.ms <http://replica.socket.timeout.ms>=30000* *replica.socket.receive.buffer.bytes=65536* *replica.lag.time.max.ms <http://replica.lag.time.max.ms>=10000* *replica.lag.max.messages=4000* *controller.socket.timeout.ms <http://controller.socket.timeout.ms>=30000* *controller.message.queue.size=10* *# Log configuration* *num.partitions=8* *message.max.bytes=1000000* *auto.create.topics.enable=false* *delete.topic.enable=true* *log.index.interval.bytes=4096* *log.index.size.max.bytes=10485760* *log.retention.hours=168* *log.retention.bytes=10737418240* *log.flush.interval.ms <http://log.flush.interval.ms>=10000* *log.flush.interval.messages=20000* *log.flush.scheduler.interval.ms <http://log.flush.scheduler.interval.ms>=2000* *log.roll.hours=168* *log.retention.check.interval.ms <http://log.retention.check.interval.ms>=300000* *log.segment.bytes=1073741824* *# ZK configuration* *zookeeper.connect=10.8.113.113:2181 <http://10.8.113.113:2181>,10.8.113.114:2181 <http://10.8.113.114:2181>,10.8.113.115:2181 <http://10.8.113.115:2181>,10.8.113.116:2181 <http://10.8.113.116:2181>,10.8.113.117:2181 <http://10.8.113.117:2181>* *zookeeper.connection.timeout.ms <http://zookeeper.connection.timeout.ms>=6000* *zookeeper.sync.time.ms <http://zookeeper.sync.time.ms>=2000* *# Socket server configuration* *num.io.threads=8* *num.network.threads=8* *socket.request.max.bytes=104857600* *socket.receive.buffer.bytes=1048576* *socket.send.buffer.bytes=1048576* *queued.max.requests=16* *fetch.purgatory.purge.interval.requests=100* *producer.purgatory.purge.interval.requests=100* The following is the server.log. *[From 10:31:20,136 to 10:31:20,158, the low showed 'Scheduling log segment' for all segments file under partition 2]* *[2015-09-16 10:31:20,136] INFO Scheduling log segment 323671365 for log ubt.mobilemonitor-2 for deletion. (kafka.log.Log)* *[2015-09-16 10:31:20,143] INFO Scheduling log segment 324603130 for log ubt.mobilemonitor-2 for deletion. (kafka.log.Log)* *[2015-09-16 10:31:20,143] INFO Scheduling log segment 325534373 for log ubt.mobilemonitor-2 for deletion. (kafka.log.Log)* *.....* *[2015-09-16 10:31:20,158] INFO Scheduling log segment 452156179 for log ubt.mobilemonitor-2 for deletion. (kafka.log.Log)* *[2015-09-16 10:31:20,158] INFO Scheduling log segment 453167436 for log ubt.mobilemonitor-2 for deletion. (kafka.log.Log)* *[Broker found the offset is not in sync with master anymore, it shrinked the ISR]* *[2015-09-16 10:31:20,192] WARN [ReplicaFetcherThread-1-1088254], Replica 10811396 for partition [ubt.mobilemonitor,2] reset its fetch offset from 323671365 to current leader 1088254's start offset 323671365 (kafka.server.ReplicaFetcherThread)* *[2015-09-16 10:31:20,192] ERROR [ReplicaFetcherThread-1-1088254], Current offset 453840204 for partition [ubt.mobilemonitor,2] out of range; reset offset to 323671365 (kafka.server.ReplicaFetcherThread)* *[2015-09-16 10:31:22,367] INFO Partition [ubt.mobilemonitor,2] on broker 1088254: Shrinking ISR for partition [ubt.mobilemonitor,2] from 1088254,10811396 to 1088254 (kafka.cluster.Partition)* *[Producer was still writing data, and segments were deleted at the same time]* *[2015-09-16 10:31:52,502] INFO Rolled new log segment for 'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)* *[2015-09-16 10:32:20,143] INFO Deleting segment 323671365 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,143] INFO Deleting segment 324603130 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 326466801 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 329263345 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 327398756 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 330194386 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 325534373 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 328330912 from log ubt.mobilemonitor-2. (kafka.log.Log)* *[2015-09-16 10:32:20,144] INFO Deleting segment 331126657 from log ubt.mobilemonitor-2. (kafka.log.Log) * *.....* *[index files were deleted]* *[2015-09-16 10:32:20,360] INFO Deleting index /data05/kafka/ubt.mobilemonitor-2/00000000000324603130.index.deleted (kafka.log.OffsetIndex)* *[2015-09-16 10:32:20,448] INFO Deleting index /data05/kafka/ubt.mobilemonitor-2/00000000000325534373.index.deleted (kafka.log.OffsetIndex)* *[2015-09-16 10:32:20,478] INFO Deleting index /data05/kafka/ubt.mobilemonitor-2/00000000000326466801.index.deleted (kafka.log.OffsetIndex)* *.....* *[FetchThread started to fetch replicas]* *[2015-09-16 10:33:02,021] INFO Rolled new log segment for 'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)* *[2015-09-16 10:33:34,441] INFO Rolled new log segment for 'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)* *[2015-09-16 10:34:05,636] INFO Rolled new log segment for 'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)* *[2015-09-16 10:34:36,266] INFO Rolled new log segment for 'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)* *[2015-09-16 10:35:07,829] INFO Rolled new log segment for 'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)* *[From 10:33:02 to 11:46:03, all segments were recovered, and ISR expended]* *[2015-09-16 11:46:03,679] INFO Partition [ubt.mobilemonitor,2] on broker 1088254: Expanding ISR for partition [ubt.mobilemonitor,2] from 1088254 to 1088254,10811396 (kafka.cluster.Partition) * -- With Warm regards Yiming Liu (刘一鸣)