[ 
https://issues.apache.org/jira/browse/KAFKA-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774661#comment-16774661
 ] 

ASF GitHub Bot commented on KAFKA-7974:
---------------------------------------

nickbp commented on pull request #6305: Fix for KAFKA-7974: Avoid calling 
disconnect() when not yet connecting
URL: https://github.com/apache/kafka/pull/6305
 
 
   When attempting to get topic list via KafkaAdminClient against a server that 
isn't resolvable, the worker thread can get killed as follows, leading to a 
zombie KafkaAdminClient:
   
   ```
   ERROR [kafka-admin-client-thread | adminclient-1] 2019-02-18 01:00:45,597 
KafkaThread.java:51 - Uncaught exception in thread 'kafka-admin-client-thread | 
adminclient-1':
   java.lang.IllegalStateException: No entry found for connection 0
       at 
org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330)
       at 
org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134)
       at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:921)
       at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
       at 
org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.sendEligibleCalls(KafkaAdminClient.java:898)
       at 
org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1113)
       at java.lang.Thread.run(Thread.java:748)
   ```
   
   It looks like cause is a bug in state handling between `NetworkClient` and 
`ClusterConnectionStates`:
   - `NetworkClient.ready()` invokes `this.initiateConnect()` as seen in the 
above stacktrace
   - `NetworkClient.initiateConnect()` invokes 
`ClusterConnectionStates.connecting()`, which internally invokes 
`ClientUtils.resolve()` to resolve the host when creating an entry for the 
connection.
   - If this host lookup fails, a `UnknownHostException` can be thrown back to 
`NetworkClient.initiateConnect()` and the connection entry is not created in 
`ClusterConnectionStates`. This exception doesn't currently get logged so this 
is a guess on my part.
   - `NetworkClient.initiateConnect()` catches the exception and attempts to 
call `ClusterConnectionStates.disconnected()`, which throws an 
`IllegalStateException` because no entry had yet been created due to the lookup 
failure.
   - This `IllegalStateException` ends up killing the worker thread and 
`KafkaAdminClient` gets stuck, never returning from `listTopics()`.
   
   This PR includes a unit test which reproduces the original issue (matching 
stacktrace) and verifies the fix.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> KafkaAdminClient loses worker thread/enters zombie state when initial DNS 
> lookup fails
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7974
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7974
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Nicholas Parker
>            Priority: Major
>
> Version: kafka-clients-2.1.0
> I have some code that creates creates a KafkaAdminClient instance and then 
> invokes listTopics(). I was seeing the following stacktrace in the logs, 
> after which the KafkaAdminClient instance became unresponsive:
> {code:java}
> ERROR [kafka-admin-client-thread | adminclient-1] 2019-02-18 01:00:45,597 
> KafkaThread.java:51 - Uncaught exception in thread 'kafka-admin-client-thread 
> | adminclient-1':
> java.lang.IllegalStateException: No entry found for connection 0
>     at 
> org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330)
>     at 
> org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134)
>     at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:921)
>     at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
>     at 
> org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.sendEligibleCalls(KafkaAdminClient.java:898)
>     at 
> org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1113)
>     at java.lang.Thread.run(Thread.java:748){code}
> From looking at the code I was able to trace down a possible cause:
>  * NetworkClient.ready() invokes this.initiateConnect() as seen in the above 
> stacktrace
>  * NetworkClient.initiateConnect() invokes 
> ClusterConnectionStates.connecting(), which internally invokes 
> ClientUtils.resolve() to to resolve the host when creating an entry for the 
> connection.
>  * If this host lookup fails, a UnknownHostException can be thrown back to 
> NetworkClient.initiateConnect() and the connection entry is not created in 
> ClusterConnectionStates. This exception doesn't get logged so this is a guess 
> on my part.
>  * NetworkClient.initiateConnect() catches the exception and attempts to call 
> ClusterConnectionStates.disconnected(), which throws an IllegalStateException 
> because no entry had yet been created due to the lookup failure.
>  * This IllegalStateException ends up killing the worker thread and 
> KafkaAdminClient gets stuck, never returning from listTopics().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to