[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140868#comment-14140868
 ] 

Jay Kreps commented on KAFKA-1642:
----------------------------------

The intended behavior is that the client will periodically attempt to reconnect 
and update metadata until either it can reconnect or it discovers that a new 
node has taken over leadership for the given partition.

There are two things that could be going on here: (1) our default backoffs 
could be too low or (2) the network selector could be busy waiting. The 
backoffs are controlled by reconnect.backoff.ms and retry.backoff.ms. 
reconnect.backoff.ms controls the amount of time to wait after the last 
connection attempt (whether successful or unsuccessful) before trying to make 
another connection attempt--this avoids trying to connect over and over again. 
This seems to default to only 10ms. The retry.backoff.ms controls the amount of 
time we wait before attempting to update the metadata. This defaults to 100ms.

Alternatively, [~guozhang] found and fixed a bug in the network selector that 
lead to busy waiting previously. Maybe there is another bug like that.

Would you be willing to try setting the two backoffs to something high and see 
if you can reproduce the problem. The ideal would be a short piece of code that 
reproduces this that we could use for testing.

> [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
> connection is lost
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1642
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1642
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 0.8.2
>            Reporter: Bhavesh Mistry
>            Assignee: Jun Rao
>
> I see my CPU spike to 100% when network connection is lost for while.  It 
> seems network  IO thread are very busy logging following error message.  Is 
> this expected behavior ?
> 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
> org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
> producer I/O thread: 
> java.lang.IllegalStateException: No entry found for node -2
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
> at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
> at 
> org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:744)
> Thanks,
> Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to