[ 
https://issues.apache.org/jira/browse/KAFKA-15766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron updated KAFKA-15766:
----------------------------
    Description: 
Hello,

This is my first dive into the Kafka source code, so I may be mis-interpreting 
some of the logic outlined below. Please let me know if my understanding is not 
correct.

--- 

After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka 
version v3.5.0), we've had numerous problems with controller failover and 
subsequent elections not being properly acknowledged by a subset of the 
cluster's brokers.  These affected brokers will temporarily become unresponsive 
due to complete request handler thread starvation via failed 
ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped 
controller. In the case of a disconnected response, as seen here, the 
broker-to-controller channel will place these failed requests at the 
*beginning* of the queue, effectively deadlocking the request handler thread 
pool. Depending on how quickly the configured queue.max.requests request queue 
fills, the new controller's UPDATE_METADATA request to notify the broker of the 
controller election result may not even be able to contact the affected broker 
to update the new controller Id. 

For context, the relevant configurations of my cluster are:
 * num.network.threads = 16
 * num.io.threads = 8
 * queued.max.requests = 500
 * inter.broker.protocol.version = 3.5
 * security.inter.broker.protocol = SASL_SSL
 * sasl.enabled.mechanisms = SCRAM-SHA-256

And the cluster itself is set up as:
 * 9 brokers
 * ~4000 partitions / broker [aggregate leader + replica]
 * ~20 MB/s data in across cluster

 

I am not able to consistently repro the issue, but it happens frequently enough 
to cause major pain in our set up. The general order of events are:
 # Controller election takes place (due to intentional failover or otherwise)
 # New controller successfully registers its’ znode with Zookeeper, initiates 
UPDATE_METADATA request to notify other brokers in the cluster
 # Problematic broker, noticing it has lost connection to the old controller 
but has not yet received and/or processed the incoming UPDATE_METADATA request, 
spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and ALTER_PARTITION 
requests to the old controller, receiving disconnect responses and placing the 
request back in the front of the request queue.

In this scenario, the needed UPDATE_METADATA request will take much longer to 
process. In our specific cluster this will sometimes auto-resolve in 3-5m 
whereas other times it will require manual intervention to bounce the broker 
and allow it to initialize a connection with the new controller. During this 
time, I see the corresponding Kafka server logs:

+Affected Broker Logs+
{code:java}
[2023-10-24 01:30:00,104] INFO
[zk-broker-4-to-controller-forwarding-channel-manager]: Recorded new 
controller, from now on will use node kafka3-app.prod.net:9192 (id: 3 rack: ) 
(kafka.server.BrokerToControllerRequestThread)

[2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4 
name=forwarding] Connection to node 3 (kafka3-app.prod.net:9192) could not be 
established. Broker may not be available. 
(org.apache.kafka.clients.NetworkClient)

[2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4 
name=forwarding] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient) 

...{code}
{code:java}
[2023-10-26 02:42:35,166] WARN [RPC ProducerId Manager 103]: Timed out when 
requesting AllocateProducerIds from the controller. 
(kafka.coordinator.transaction.RPCProducerIdManager)  
org.apache.kafka.common.errors.CoordinatorLoadInProgressException: Timed out 
waiting for next producer ID block

...{code}
 

+New Controller Logs+
{code:java}
[2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11] Controller 
11's connection to broker kafka4-app.prod.net:9192 (id: 4 rack: ) was 
unsuccessful (kafka.controller.RequestSendThread)

java.net.SocketTimeoutException: Failed to connect within 30000 ms
at 
kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249)
at 
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)
 

...{code}
 

This issue seems very similar to some previous tickets I found:
 * https://issues.apache.org/jira/browse/KAFKA-7697 
 * https://issues.apache.org/jira/browse/KAFKA-14694 

Unfortunately I have not been able to grab a threaddump because of the sporadic 
nature of the issue, but will update the ticket if I am able.

 

This may also affect other versions of Kafka, but we've only experienced it 
with v3.5.0 + IBP 3.5. Curiously enough I haven't encountered this issue with 
numerous smaller throughput clusters, so there seems to be some sort of 
load-induced trigger here. I have also not seen this issue with v3.5.0 + IBP 
2.5, assumedly because the ALLOCATE_PRODUCER_IDS and ALTER_PARTITION features 
are not yet enabled until IBP 2.7 and 3.0, respectively.

 

Is there any guidance on whether this failure path is possible as described and 
what steps can be taken to mitigate the issue?

Thank you!

 

  was:
Hello,

This is my first dive into the Kafka source code, so I may be mis-interpreting 
some of the logic outlined below. Please let me know if my understanding is not 
correct.

--- 

After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka 
version v3.5.0), we've had numerous problems with controller failover and 
subsequent elections not being properly acknowledged by a subset of the 
cluster's brokers.  These affected brokers will temporarily become unresponsive 
due to complete request handler thread starvation via failed 
ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped 
controller. In the case of a disconnected response, as seen here, the 
broker-to-controller channel will place these failed requests at the 
*beginning* of the queue, effectively deadlocking the request handler thread 
pool. Depending on how quickly the configured queue.max.requests request queue 
fills, the new controller's UPDATE_METADATA request to notify the broker of the 
controller election result may not even be able to contact the affected broker 
to update the new controller Id. 

For context, the relevant configurations of my cluster are:
 * num.network.threads = 16
 * num.io.threads = 8
 * queued.max.requests = 500
 * inter.broker.protocol.version = 3.5
 * security.inter.broker.protocol = SASL_SSL
 * sasl.enabled.mechanisms = SCRAM-SHA-256

And the cluster itself is set up as:
 * 9 brokers
 * ~4000 partitions / broker [aggregate leader + replica]
 * ~20 MB/s data in across cluster

 

I am not able to consistently repro the issue, but it happens frequently enough 
to cause major pain in our set up. The general order of events are:
 # Controller election takes place (due to intentional failover or otherwise)
 # New controller successfully registers its’ znode with Zookeeper, initiates 
UPDATE_METADATA request to notify other brokers in the cluster
 # Problematic broker, noticing it has lost connection to the old controller 
but has not yet received and/or processed the incoming UPDATE_METADATA request, 
spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and ALTER_PARTITION 
requests to the old controller, receiving disconnect responses and placing the 
request back in the front of the request queue.

In this scenario, the needed UPDATE_METADATA request will take much longer to 
process. In our specific cluster this will sometimes auto-resolve in 3-5m 
whereas other times it will require manual intervention to bounce the broker 
and allow it to initialize a connection with the new controller. During this 
time, I see the corresponding Kafka server logs:

+Affected Broker Logs+
{code:java}
[2023-10-24 01:30:00,104] INFO
[zk-broker-4-to-controller-forwarding-channel-manager]: Recorded new 
controller, from now on will use node kafka3-app.prod.net:9192 (id: 3 rack: ) 
(kafka.server.BrokerToControllerRequestThread)

[2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4 
name=forwarding] Connection to node 3 (kafka3-app.prod.net:9192) could not be 
established. Broker may not be available. 
(org.apache.kafka.clients.NetworkClient)

[2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4 
name=forwarding] Node 3 disconnected. (org.apache.kafka.clients.NetworkClient) 

{code}
{code:java}
[2023-10-26 02:42:35,166] WARN [RPC ProducerId Manager 103]: Timed out when 
requesting AllocateProducerIds from the controller. 
(kafka.coordinator.transaction.RPCProducerIdManager)  
org.apache.kafka.common.errors.CoordinatorLoadInProgressException: Timed out 
waiting for next producer ID block{code}
 

+New Controller Logs+
{code:java}
[2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11] Controller 
11's connection to broker kafka4-app.prod.net:9192 (id: 4 rack: ) was 
unsuccessful (kafka.controller.RequestSendThread)

java.net.SocketTimeoutException: Failed to connect within 30000 ms
at 
kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249)
at 
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)
 {code}
 

This issue seems very similar to some previous tickets I found:
 * https://issues.apache.org/jira/browse/KAFKA-7697 
 * https://issues.apache.org/jira/browse/KAFKA-14694 

Unfortunately I have not been able to grab a threaddump because of the sporadic 
nature of the issue, but will update the ticket if I am able.

 

This may also affect other versions of Kafka, but we've only experienced it 
with v3.5.0 + IBP 3.5. Curiously enough I haven't encountered this issue with 
numerous smaller throughput clusters, so there seems to be some sort of 
load-induced trigger here. I have also not seen this issue with v3.5.0 + IBP 
2.5, assumedly because the ALLOCATE_PRODUCER_IDS and ALTER_PARTITION features 
are not yet enabled until IBP 2.7 and 3.0, respectively.

 

Is there any guidance on whether this failure path is possible as described and 
what steps can be taken to mitigate the issue?

Thank you!

 


> Possible request handler thread exhaustion in controller election
> -----------------------------------------------------------------
>
>                 Key: KAFKA-15766
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15766
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Cameron
>            Priority: Major
>
> Hello,
> This is my first dive into the Kafka source code, so I may be 
> mis-interpreting some of the logic outlined below. Please let me know if my 
> understanding is not correct.
> --- 
> After upgrading a relatively large Kafka cluster from IPB 2.5 to 3.5 (Kafka 
> version v3.5.0), we've had numerous problems with controller failover and 
> subsequent elections not being properly acknowledged by a subset of the 
> cluster's brokers.  These affected brokers will temporarily become 
> unresponsive due to complete request handler thread starvation via failed 
> ALLOCATE_PRODUCER_IDS and ALTER_PARTITION requests to the old stopped 
> controller. In the case of a disconnected response, as seen here, the 
> broker-to-controller channel will place these failed requests at the 
> *beginning* of the queue, effectively deadlocking the request handler thread 
> pool. Depending on how quickly the configured queue.max.requests request 
> queue fills, the new controller's UPDATE_METADATA request to notify the 
> broker of the controller election result may not even be able to contact the 
> affected broker to update the new controller Id. 
> For context, the relevant configurations of my cluster are:
>  * num.network.threads = 16
>  * num.io.threads = 8
>  * queued.max.requests = 500
>  * inter.broker.protocol.version = 3.5
>  * security.inter.broker.protocol = SASL_SSL
>  * sasl.enabled.mechanisms = SCRAM-SHA-256
> And the cluster itself is set up as:
>  * 9 brokers
>  * ~4000 partitions / broker [aggregate leader + replica]
>  * ~20 MB/s data in across cluster
>  
> I am not able to consistently repro the issue, but it happens frequently 
> enough to cause major pain in our set up. The general order of events are:
>  # Controller election takes place (due to intentional failover or otherwise)
>  # New controller successfully registers its’ znode with Zookeeper, initiates 
> UPDATE_METADATA request to notify other brokers in the cluster
>  # Problematic broker, noticing it has lost connection to the old controller 
> but has not yet received and/or processed the incoming UPDATE_METADATA 
> request, spams (every ~50ms as hardcoded) ALLOCATE_PRODUCER_IDS and 
> ALTER_PARTITION requests to the old controller, receiving disconnect 
> responses and placing the request back in the front of the request queue.
> In this scenario, the needed UPDATE_METADATA request will take much longer to 
> process. In our specific cluster this will sometimes auto-resolve in 3-5m 
> whereas other times it will require manual intervention to bounce the broker 
> and allow it to initialize a connection with the new controller. During this 
> time, I see the corresponding Kafka server logs:
> +Affected Broker Logs+
> {code:java}
> [2023-10-24 01:30:00,104] INFO
> [zk-broker-4-to-controller-forwarding-channel-manager]: Recorded new 
> controller, from now on will use node kafka3-app.prod.net:9192 (id: 3 rack: ) 
> (kafka.server.BrokerToControllerRequestThread)
> [2023-10-24 01:30:00,104] WARN [BrokerToControllerChannelManager id=4 
> name=forwarding] Connection to node 3 (kafka3-app.prod.net:9192) could not be 
> established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2023-10-24 01:30:00,104] INFO [BrokerToControllerChannelManager id=4 
> name=forwarding] Node 3 disconnected. 
> (org.apache.kafka.clients.NetworkClient) 
> ...{code}
> {code:java}
> [2023-10-26 02:42:35,166] WARN [RPC ProducerId Manager 103]: Timed out when 
> requesting AllocateProducerIds from the controller. 
> (kafka.coordinator.transaction.RPCProducerIdManager)  
> org.apache.kafka.common.errors.CoordinatorLoadInProgressException: Timed out 
> waiting for next producer ID block
> ...{code}
>  
> +New Controller Logs+
> {code:java}
> [2023-10-24 01:33:20,569] WARN [RequestSendThread controllerId=11] Controller 
> 11's connection to broker kafka4-app.prod.net:9192 (id: 4 rack: ) was 
> unsuccessful (kafka.controller.RequestSendThread)
> java.net.SocketTimeoutException: Failed to connect within 30000 ms
> at 
> kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:297)
> at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:249)
> at 
> org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)
>  
> ...{code}
>  
> This issue seems very similar to some previous tickets I found:
>  * https://issues.apache.org/jira/browse/KAFKA-7697 
>  * https://issues.apache.org/jira/browse/KAFKA-14694 
> Unfortunately I have not been able to grab a threaddump because of the 
> sporadic nature of the issue, but will update the ticket if I am able.
>  
> This may also affect other versions of Kafka, but we've only experienced it 
> with v3.5.0 + IBP 3.5. Curiously enough I haven't encountered this issue with 
> numerous smaller throughput clusters, so there seems to be some sort of 
> load-induced trigger here. I have also not seen this issue with v3.5.0 + IBP 
> 2.5, assumedly because the ALLOCATE_PRODUCER_IDS and ALTER_PARTITION features 
> are not yet enabled until IBP 2.7 and 3.0, respectively.
>  
> Is there any guidance on whether this failure path is possible as described 
> and what steps can be taken to mitigate the issue?
> Thank you!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to