And after a 2nd (and 3rd) opinion it does look like replica.lag.time.max.ms is below the default value, so maybe try increasing as a first step, Paul
From: Brebner, Paul <paul.breb...@netapp.com.INVALID> Date: Thursday, 13 February 2025 at 2:33 pm To: users@kafka.apache.org <users@kafka.apache.org> Subject: Re: Under replicated partition EXTERNAL EMAIL - USE CAUTION when clicking links or attachments Hi Ernar, I don’t think anyone responded yet so here’s my 2 cents worth (I’m not a Kafka ops expert, but I did ask our Kafka techops people – the following are suggestions however, not professional advice – which we do also offer 😉): Looks like there is more traffic at nigh and cluster struggles to replication (maybe – can’t tell without metrics). Increasing replica.lag.time.max.ms is probably not going to solve the problem, just increases the time until under replicated partitions. You could try to increase num.replica.fetchers – if there is enough resources Good luck! Regards, Paul Brebner From: Ernar Ratbek <ernar.rat...@gmail.com> Date: Friday, 7 February 2025 at 3:46 pm To: users@kafka.apache.org <users@kafka.apache.org> Subject: Under replicated partition [You don't often get email from ernar.rat...@gmail.com. Learn why this is important at https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!Nhn8V6BzJA!ReX2_thFBlp2D3fV1iUPJq-V2s4vVSjP2nONFAN8W49f7WtzGmj3XI2hbkNf64bt-MNRHOWMQlu3Lx-N8vxuNMrxmEEA8diCn8s$<https://urldefense.com/v3/__https:/aka.ms/LearnAboutSenderIdentification__;!!Nhn8V6BzJA!ReX2_thFBlp2D3fV1iUPJq-V2s4vVSjP2nONFAN8W49f7WtzGmj3XI2hbkNf64bt-MNRHOWMQlu3Lx-N8vxuNMrxmEEA8diCn8s$> ] EXTERNAL EMAIL - USE CAUTION when clicking links or attachments Good day! I have 9 broker nodes and 3 craft controller nodes. It is at night that I receive alerts There are 2 under replicated partitions. And it resolves after about 30 seconds. From the logs you can see: [2025-02-04 02:13:16,421] INFO [Partition colvir.deposit.getclientdeposits.in-10 broker=11] Shrinking ISR from 11.9.8 to 11.8. Leader: (highWatermark: 77947, endOffset: 77948). Out of sync replicas: (brokerId: 9, endOffset: 77947, lastCaughtUpTimeMs: 1738613593403). (kafka.cluster.Partition) [2025-02-04 02:21:35,421] INFO [Partition communication.notificationmanager.getnotificationstats.in-9 broker=11] Shrinking ISR from 11,7,9 to 11,7. Leader: (highWatermark: 189276, endOffset: 189277). Out of sync replicas: (brokerId: 9, endOffset: 189276, lastCaughtUpTimeMs: 1738614094101). (kafka.cluster.Partition) my server properties: ############################# Other Settings ############################# group.initial.rebalance.delay.ms=0 log.retention.check.interval.ms=30000 log.retention.hours=24 log.roll.hours=1 log.segment.bytes=1073741824 num.io.threads=16 num.network.threads=8 num.recovery.threads.per.data.dir=2 offsets.topic.replication.factor=3 socket.receive.buffer.bytes=1024000 socket.request.max.bytes=104857600 socket.send.buffer.bytes=1024000 transaction.state.log.min.isr=2 transaction.state.log.replication.factor=3 zookeeper.connection.timeout.ms=10000 delete.topic.enable=True replica.fetch.max.bytes=5242880 max.message.bytes=5242880 message.max.bytes=5242880 default.replication.factor=3 min.insync.replicas=2 replica.fetch.wait.max.ms=200 replica.lag.time.max.ms=1000 should I increase replica.lag.time.max.ms?