Never mind, I found answer. I had unexpected cron on firing up every 5 minutes 
and blast cluster with connections from +2k additional servers.


> On Apr 8, 2020, at 10:46, Jacek Szewczyk <jacek7...@gmail.com> wrote:
> 
> Hi All,
>  
> I am seeing strange behavior for Kafka 2.0.0.3.1.4. My cluster contains 9 
> brokers + 3 dedicated zookeepers and for unknown reason there is spike in CPU 
> every 5 minutes which cause timeouts between producers, consumers and 
> brokers. Basically every 5 minutes CPU spikes to 90+ % and at the same time 
> network utilization goes down to almost 0 (and should be in 100MBps range)
> Each broker has 64G of memory, Heap set to 9G, there are 8 cores and 4G 
> uplink. I have 500 partitions (replication=2)  and there were around 1000 
> producers sending data in 1 minute intervals. Summary input rate is around 
> 600k/s
> Here is my config:
>  
> auto.create.topics.enable=false
> auto.leader.rebalance.enable=true
> compression.type=producer
> controlled.shutdown.enable=true
> controlled.shutdown.max.retries=3
> controlled.shutdown.retry.backoff.ms=5000
> controller.message.queue.size=10
> controller.socket.timeout.ms=30000
> default.replication.factor=2
> delete.topic.enable=true
> leader.imbalance.check.interval.seconds=300
> leader.imbalance.per.broker.percentage=10
> listeners=PLAINTEXT://localhost:9092 <plaintext://localhost:9092>
> log.cleanup.interval.mins=10
> log.dirs=/diskc/kafka-logs,/diskd/kafka-logs,/diske/kafka-logs,/diskf/kafka-logs,/diskg/kafka-logs,/diskh/kafka-logs,/diskj/kafka-logs,/diskk/kafka-logs
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.bytes=-1
> log.retention.check.interval.ms=600000
> log.retention.hours=24
> log.roll.hours=24
> log.segment.bytes=1073741824
> message.max.bytes=1000000
> min.insync.replicas=1
> num.io.threads=8
> num.network.threads=3000
> num.partitions=100
> num.recovery.threads.per.data.dir=4
> num.replica.fetchers=4
> offset.metadata.max.bytes=4096
> offsets.commit.required.acks=-1
> offsets.commit.timeout.ms=5000
> offsets.load.buffer.size=5242880
> offsets.retention.check.interval.ms=600000
> offsets.retention.minutes=86400000
> offsets.topic.compression.codec=0
> offsets.topic.num.partitions=50
> offsets.topic.replication.factor=3
> offsets.topic.segment.bytes=104857600
> producer.metrics.enable=false
> producer.purgatory.purge.interval.requests=10000
> queued.max.requests=500
> replica.fetch.max.bytes=1048576
> replica.fetch.min.bytes=1
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.lag.max.messages=4000
> replica.lag.time.max.ms=10000
> replica.socket.receive.buffer.bytes=65536
> replica.socket.timeout.ms=30000
> sasl.enabled.mechanisms=GSSAPI
> sasl.mechanism.inter.broker.protocol=GSSAPI
> security.inter.broker.protocol=PLAINTEXT
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
> zookeeper.connection.timeout.ms=25000
> zookeeper.session.timeout.ms=30000
> zookeeper.sync.time.ms=2000
>  
> In log messages for every spike it starts with shrinking replication and 
> continues with timeouts like this:
> INFO [Partition partition-220 broker=1010] Shrinking ISR from 1010,1006 to 
> 1010 (kafka.cluster.Partition)
> And ton of messages:
> WARN Attempting to send response via channel for which there is no open 
> connection, connection IP:9092-IP:45520-1 (kafka.network.Processor)
> WARN [ReplicaFetcher replicaId=1010, leaderId=1006, fetcherId=0] Error in 
> response for fetch request (type=FetchRequest, replicaId=1010, maxWait=500, 
> minBytes=1, maxBytes=10485760, fetchData={topic-312=(offset=984223132, 
> logStartOffset=916099079, maxBytes=1048576)}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1467131318, 
> epoch=5031)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 1006 was disconnected before the response 
> was read
>         at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
>         at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
>         at 
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
>         at 
> kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
>         at 
> kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
>         at 
> kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
> I’ve also tried topics with no replication even for __consumer_offsets but 
> result was the same. The only setting which makes difference is less nr of 
> partitions, so if changed from 500 to 200 it is more stable but still 5m 
> spike exists. 
> I’ve played around with multiple settings and issue persist no matter what.
>  
> I would be grateful if anyone can comment on cpu spikes and shed some light 
> how to fix/improve it.
>  
> Thanks,
> Jacek

Reply via email to