[ https://issues.apache.org/jira/browse/KAFKA-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Byrne resolved KAFKA-9395. -------------------------------- Assignee: Rajini Sivaram (was: Brian Byrne) Resolution: Done > Improve Kafka scheduler's periodic maybeShrinkIsr() > --------------------------------------------------- > > Key: KAFKA-9395 > URL: https://issues.apache.org/jira/browse/KAFKA-9395 > Project: Kafka > Issue Type: Improvement > Reporter: Brian Byrne > Assignee: Rajini Sivaram > Priority: Major > > The ReplicaManager schedules a periodic call to maybeShrinkIsr() with the > KafkaScheduler for a period of replica.lag.time.max.ms / 2. While > replica.lag.time.max.ms defaults to 30s, my setup was 45s, which means > maybeShrinkIsr() was being called every 22.5 seconds. Normally this is not a > problem. > Fetch/produce requests hold a partition's leaderIsrUpdateLock in reader mode > while they are running. When a partition is requested to check whether it > should shrink its ISR, it acquires a write lock. So there's potential for > contention here, and if the fetch/produce requests are long running, they may > block maybeShrinkIsr() for hundreds of ms. > This becomes a problem due to the way the scheduler runnable is set up: it > calls maybeShrinkIsr() for partition per single scheduler invocation. If > there's a lot of partitions, this could take many seconds, even minutes. > However, the runnable is scheduled via > ScheduledThreadPoolExecutor#scheduleAtFixedRate, which means if it exceeds > its period, it's immediately scheduled to run again. So it backs up enough > that the scheduler is always executing this function. > This may cause partitions to periodically check their ISR a lot less > frequently than intended. This also contributes a huge source of contention > for cases where the produce/fetch requests are long-running. -- This message was sent by Atlassian Jira (v8.3.4#803005)