[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230715#comment-14230715
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
----------------------------------------------

Attached a new patch that fixes all the timeout issues I'm aware of. Here's how 
it addresses each of the situations I listed earlier:

1. lastNoNodeAvailableMs is updated, which forces metadata timeout for each 
poll to be  use a backoff period

2. Added another backoff value based on metadataFetchInProgress. Since the 
request actually made it out, this can be arbitrarily large -- we just need to 
see some sort of response or failure for the request.

3. Requires some response to arrive to clear out space, so we can wait 
arbitrarily long. Updating lastNoNodeAvailableMs works even though it may wake 
up sooner than necessary. But making all cases that didn't send the data use a 
single approach keeps the code simpler.

4a. This can happen if, e.g., the network interface has been taken down 
entirely. After fixing the ordering of marking the node as connecting and 
issuing the request, this cleans up after that error cleanly. 
lastNoNodeAvailableMs is updated since there *weren't* any nodes available. 
This triggers a backoff period where the connection won't be retried.

4b. This can be handled in the same way - we set lastNoNodeAvailableMs whether 
or not we immediately saw an error from the connection request. This causes it 
to sleep while waiting for the connection request. This may wake up before we 
get connected. However, if the node is still in the connecting state, it'll be 
ignored during the next round and we'll either start trying to connect to 
another node or we'll end up in state 1 with no nodes available. Either way, we 
still only wake up periodically based on the this timeout.

5. Looking more carefully at how leastLoadedNode works, this case isn't 
actually possible.

One additional note -- apparently you can't use Long.MAX_VALUE as a timeout, it 
throws an exception. That's why Integer.MAX_VALUE is there instead. We could 
also detect the large value and convert it to a negative value instead, which 
the underlying API treats as having no timeout.

[~Bmis13] can you test this out for the failure modes you found?

> [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
> connection is lost
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1642
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1642
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 0.8.2
>            Reporter: Bhavesh Mistry
>            Assignee: Ewen Cheslack-Postava
>            Priority: Blocker
>             Fix For: 0.8.2
>
>         Attachments: 
> 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
> KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
> KAFKA-1642_2014-10-23_16:19:41.patch
>
>
> I see my CPU spike to 100% when network connection is lost for while.  It 
> seems network  IO thread are very busy logging following error message.  Is 
> this expected behavior ?
> 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
> org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
> producer I/O thread: 
> java.lang.IllegalStateException: No entry found for node -2
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
> at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
> at 
> org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:744)
> Thanks,
> Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to