Hi All, We are using Apache Kafka 2.1/JVM 1.8_163 on top of Ubuntu 18 LTS running on AWS r5a.2xl. Log directory resides on LVM volume based on 6 gp2 EBS volumes.
After upgrading from 1.1 to 2.1 we started to suffer from strange behavior on consumers: Consumer fails while the last offset is younger than the previous one. We started to look at logs and found the following log entries located on the same broker in server.log: *[2019-04-18 07:03:41,016] INFO [Log partition=adunit_events-103, dir=/var/lib/kafka/data] Rolled new log segment at offset 3605813651 in 1 ms. (kafka.log.Log)* *...* *[2019-04-18 07:09:32,580] INFO [Log partition=adunit_events-103, dir=/var/lib/kafka/data] Incrementing log start offset to 3560783759 (kafka.log.Log) * As you can see 6 minutes after a new log segment rolled out offset went back from 3605813651 to 3560783759. This happened during cluster rolling restart. However, it didn't happen right after broker restart. *JVM CONFIG:* *java -Xms12g -Xmx12g -XX:+ExplicitGCInvokesConcurrent -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:NewSize=9G -XX:MaxNewSize=9G -XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 -XX:G1HeapWastePercent=1 -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=3 -XX:GCLogFileSize=100M* *Server config:* *log.dirs=/var/lib/kafka/datanum.io.threads=3num.network.threads=32socket.send.buffer.bytes=33554432socket.receive.buffer.bytes=33554432socket.request.max.bytes=104857600num.partitions=2log.retention.hours=24log.segment.bytes=536870912log.retention.check.interval.ms <http://log.retention.check.interval.ms>=60000zookeeper.connection.timeout.ms <http://zookeeper.connection.timeout.ms>=1000000controlled.shutdown.enable=trueauto.leader.rebalance.enable=truelog.cleaner.enable=truelog.cleaner.min.cleanable.ratio=0.1log.cleaner.threads=1log.cleanup.policy=deletelog.cleaner.delete.retention.ms <http://log.cleaner.delete.retention.ms>=86400000log.cleaner.io.max.bytes.per.second=1.7976931348623157E308log.message.format.version=2.1inter.broker.protocol.version=2.1num.recovery.threads.per.data.dir=1log.flush.interval.messages=9223372036854775807message.max.bytes=10000000replica.fetch.max.bytes=10000000default.replication.factor=2delete.topic.enable=trueunclean.leader.election.enable=falsecompression.type=snappy* Any ideas? Thanks in advance! Seva Feldman