[
https://issues.apache.org/jira/browse/KAFKA-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maytee Chinavanichkit updated KAFKA-6051:
-----------------------------------------
Description:
The ReplicaFetcherBlockingSend works as designed and will blocks until it is
able to get data. This becomes a problem when we are gracefully shutting down a
broker. The controller will attempt to shutdown the fetchers and elect new
leaders. When the last fetch of partition is removed, as part of the
{{replicaManager.becomeLeaderOrFollower}} call will proceed to shut down any
idle ReplicaFetcherThread. The shutdown process here can block up to until the
last fetch request completes. This blocking delay is a big problem because the
{{replicaStateChangeLock}}, and {{mapLock}} in {{AbstractFetcherManager}} is
still locked causing latency spikes on multiple brokers.
At this point in time, we do not need the last response as the fetcher is
shutting down. We should close the leaderEndpoint early during
{{initiateShutdown()}} instead of after {{super.shutdown()}}.
For example we see here the shutdown blocked the broker from processing more
replica changes for ~500 ms
{code}
[2017-09-01 18:11:42,879] INFO [ReplicaFetcherThread-0-2], Shutting down
(kafka.server.ReplicaFetcherThread)
[2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Stopped
(kafka.server.ReplicaFetcherThread)
[2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Shutdown completed
(kafka.server.ReplicaFetcherThread)
{code}
was:
The ReplicaFetcherBlockingSend works as designed and will blocks until it is
able to get data. This becomes a problem when we are gracefully shutting down a
broker. The controller will attempt to shutdown the fetchers and elect new
leaders. When the last fetch of partition is removed, as part of the
{replicaManager.becomeLeaderOrFollower} call will proceed to shut down any idle
ReplicaFetcherThread. The shutdown process here can block up to until the last
fetch request completes. This blocking delay is a big problem because the
{replicaStateChangeLock}, and {mapLock} in {AbstractFetcherManager} is still
locked causing latency spikes on multiple brokers.
At this point in time, we do not need the last response as the fetcher is
shutting down. We should close the leaderEndpoint early during
{initiateShutdown()} instead of after {super.shutdown()}.
For example we see here the shutdown blocked the broker from processing more
replica changes for ~500 ms
{code}
[2017-09-01 18:11:42,879] INFO [ReplicaFetcherThread-0-2], Shutting down
(kafka.server.ReplicaFetcherThread)
[2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Stopped
(kafka.server.ReplicaFetcherThread)
[2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Shutdown completed
(kafka.server.ReplicaFetcherThread)
{code}
> ReplicaFetcherThread should close the ReplicaFetcherBlockingSend earlier on
> shutdown
> ------------------------------------------------------------------------------------
>
> Key: KAFKA-6051
> URL: https://issues.apache.org/jira/browse/KAFKA-6051
> Project: Kafka
> Issue Type: Bug
> Reporter: Maytee Chinavanichkit
>
> The ReplicaFetcherBlockingSend works as designed and will blocks until it is
> able to get data. This becomes a problem when we are gracefully shutting down
> a broker. The controller will attempt to shutdown the fetchers and elect new
> leaders. When the last fetch of partition is removed, as part of the
> {{replicaManager.becomeLeaderOrFollower}} call will proceed to shut down any
> idle ReplicaFetcherThread. The shutdown process here can block up to until
> the last fetch request completes. This blocking delay is a big problem
> because the {{replicaStateChangeLock}}, and {{mapLock}} in
> {{AbstractFetcherManager}} is still locked causing latency spikes on multiple
> brokers.
> At this point in time, we do not need the last response as the fetcher is
> shutting down. We should close the leaderEndpoint early during
> {{initiateShutdown()}} instead of after {{super.shutdown()}}.
> For example we see here the shutdown blocked the broker from processing more
> replica changes for ~500 ms
> {code}
> [2017-09-01 18:11:42,879] INFO [ReplicaFetcherThread-0-2], Shutting down
> (kafka.server.ReplicaFetcherThread)
> [2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Stopped
> (kafka.server.ReplicaFetcherThread)
> [2017-09-01 18:11:43,314] INFO [ReplicaFetcherThread-0-2], Shutdown completed
> (kafka.server.ReplicaFetcherThread)
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)