Hi Ernar, I don’t think anyone responded yet so here’s my 2 cents worth (I’m 
not a Kafka ops expert, but I did ask our Kafka techops people – the following 
are suggestions however, not professional advice – which we do also offer 😉):

Looks like there is more traffic at nigh and cluster struggles to replication 
(maybe – can’t tell without metrics).

Increasing replica.lag.time.max.ms is probably not going to solve the problem, 
just increases the time until under replicated partitions.

You could try to increase num.replica.fetchers – if there is enough resources

Good luck!

Regards, Paul Brebner

From: Ernar Ratbek <ernar.rat...@gmail.com>
Date: Friday, 7 February 2025 at 3:46 pm
To: users@kafka.apache.org <users@kafka.apache.org>
Subject: Under replicated partition
[You don't often get email from ernar.rat...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Good day!  I have 9 broker nodes and 3 craft controller nodes. It is at
night that I receive alerts There are 2 under replicated partitions. And it
resolves after about 30 seconds. From the logs you can see:
[2025-02-04 02:13:16,421] INFO [Partition
colvir.deposit.getclientdeposits.in-10 broker=11] Shrinking ISR from 11.9.8
to 11.8. Leader: (highWatermark: 77947, endOffset: 77948). Out of sync
replicas: (brokerId: 9, endOffset: 77947, lastCaughtUpTimeMs:
1738613593403). (kafka.cluster.Partition)
[2025-02-04 02:21:35,421] INFO [Partition
communication.notificationmanager.getnotificationstats.in-9 broker=11]
Shrinking ISR from 11,7,9 to 11,7. Leader: (highWatermark: 189276,
endOffset: 189277). Out of sync replicas: (brokerId: 9, endOffset: 189276,
lastCaughtUpTimeMs: 1738614094101). (kafka.cluster.Partition)
my server properties:

############################# Other Settings #############################
group.initial.rebalance.delay.ms=0
log.retention.check.interval.ms=30000
log.retention.hours=24
log.roll.hours=1
log.segment.bytes=1073741824
num.io.threads=16
num.network.threads=8
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=1024000
socket.request.max.bytes=104857600
socket.send.buffer.bytes=1024000
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
zookeeper.connection.timeout.ms=10000
delete.topic.enable=True
replica.fetch.max.bytes=5242880
max.message.bytes=5242880
message.max.bytes=5242880
default.replication.factor=3
min.insync.replicas=2
replica.fetch.wait.max.ms=200
replica.lag.time.max.ms=1000

should I increase replica.lag.time.max.ms?

Reply via email to