Never mind, I found answer. I had unexpected cron on firing up every 5 minutes and blast cluster with connections from +2k additional servers.
> On Apr 8, 2020, at 10:46, Jacek Szewczyk <jacek7...@gmail.com> wrote: > > Hi All, > > I am seeing strange behavior for Kafka 2.0.0.3.1.4. My cluster contains 9 > brokers + 3 dedicated zookeepers and for unknown reason there is spike in CPU > every 5 minutes which cause timeouts between producers, consumers and > brokers. Basically every 5 minutes CPU spikes to 90+ % and at the same time > network utilization goes down to almost 0 (and should be in 100MBps range) > Each broker has 64G of memory, Heap set to 9G, there are 8 cores and 4G > uplink. I have 500 partitions (replication=2) and there were around 1000 > producers sending data in 1 minute intervals. Summary input rate is around > 600k/s > Here is my config: > > auto.create.topics.enable=false > auto.leader.rebalance.enable=true > compression.type=producer > controlled.shutdown.enable=true > controlled.shutdown.max.retries=3 > controlled.shutdown.retry.backoff.ms=5000 > controller.message.queue.size=10 > controller.socket.timeout.ms=30000 > default.replication.factor=2 > delete.topic.enable=true > leader.imbalance.check.interval.seconds=300 > leader.imbalance.per.broker.percentage=10 > listeners=PLAINTEXT://localhost:9092 <plaintext://localhost:9092> > log.cleanup.interval.mins=10 > log.dirs=/diskc/kafka-logs,/diskd/kafka-logs,/diske/kafka-logs,/diskf/kafka-logs,/diskg/kafka-logs,/diskh/kafka-logs,/diskj/kafka-logs,/diskk/kafka-logs > log.index.interval.bytes=4096 > log.index.size.max.bytes=10485760 > log.retention.bytes=-1 > log.retention.check.interval.ms=600000 > log.retention.hours=24 > log.roll.hours=24 > log.segment.bytes=1073741824 > message.max.bytes=1000000 > min.insync.replicas=1 > num.io.threads=8 > num.network.threads=3000 > num.partitions=100 > num.recovery.threads.per.data.dir=4 > num.replica.fetchers=4 > offset.metadata.max.bytes=4096 > offsets.commit.required.acks=-1 > offsets.commit.timeout.ms=5000 > offsets.load.buffer.size=5242880 > offsets.retention.check.interval.ms=600000 > offsets.retention.minutes=86400000 > offsets.topic.compression.codec=0 > offsets.topic.num.partitions=50 > offsets.topic.replication.factor=3 > offsets.topic.segment.bytes=104857600 > producer.metrics.enable=false > producer.purgatory.purge.interval.requests=10000 > queued.max.requests=500 > replica.fetch.max.bytes=1048576 > replica.fetch.min.bytes=1 > replica.fetch.wait.max.ms=500 > replica.high.watermark.checkpoint.interval.ms=5000 > replica.lag.max.messages=4000 > replica.lag.time.max.ms=10000 > replica.socket.receive.buffer.bytes=65536 > replica.socket.timeout.ms=30000 > sasl.enabled.mechanisms=GSSAPI > sasl.mechanism.inter.broker.protocol=GSSAPI > security.inter.broker.protocol=PLAINTEXT > socket.receive.buffer.bytes=102400 > socket.request.max.bytes=104857600 > socket.send.buffer.bytes=102400 > zookeeper.connect=zk1:2181,zk2:2181,zk3:2181 > zookeeper.connection.timeout.ms=25000 > zookeeper.session.timeout.ms=30000 > zookeeper.sync.time.ms=2000 > > In log messages for every spike it starts with shrinking replication and > continues with timeouts like this: > INFO [Partition partition-220 broker=1010] Shrinking ISR from 1010,1006 to > 1010 (kafka.cluster.Partition) > And ton of messages: > WARN Attempting to send response via channel for which there is no open > connection, connection IP:9092-IP:45520-1 (kafka.network.Processor) > WARN [ReplicaFetcher replicaId=1010, leaderId=1006, fetcherId=0] Error in > response for fetch request (type=FetchRequest, replicaId=1010, maxWait=500, > minBytes=1, maxBytes=10485760, fetchData={topic-312=(offset=984223132, > logStartOffset=916099079, maxBytes=1048576)}, > isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1467131318, > epoch=5031)) (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 1006 was disconnected before the response > was read > at > org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) > at > kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > > > I’ve also tried topics with no replication even for __consumer_offsets but > result was the same. The only setting which makes difference is less nr of > partitions, so if changed from 500 to 200 it is more stable but still 5m > spike exists. > I’ve played around with multiple settings and issue persist no matter what. > > I would be grateful if anyone can comment on cpu spikes and shed some light > how to fix/improve it. > > Thanks, > Jacek