[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

Ewen Cheslack-Postava (JIRA) Tue, 02 Dec 2014 16:58:03 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232407#comment-14232407
 ]


Ewen Cheslack-Postava commented on KAFKA-1642:
----------------------------------------------

[~Bmis13], responses to each item:

1. I'm specifically trying to address the CPU usage here. I realize from your 
perspective they are closely related since they're both can be triggered by a 
loss of network connectivity, but internally they're really separate issues -- 
the CPU usage has to do with incorrect timeouts and the join() issues is due to 
the lack of timeouts on produce operations. That's why I pointed you toward 
KAFKA-1788. If a timeout is added for data in the producer, that would resolve 
the close issue as well since any data waiting in the producer would eventually 
timeout and the IO thread could exit. I think that's the cleanest solution 
since it solves both problems with a single setting (the amount of time your 
willing to wait before discarding data). If you think a separate timeout 
specifically for Producer.close() is worthwhile I'd suggest filing a separate 
JIRA for that.

2. I think the measure() implementation you have is incorrect, it looks like 
the amount of time since the last metadata update. In any case, I think you'd 
want a sensor there so it would take care of computing the rate for you. Is the 
use case for this just detecting the CPU spike? I suppose providing the average 
iteration rate can at least tell you the source of CPU usage compared to just 
monitoring CPU at the system level. But (as described below) it's not 
necessarily bad for the rate to be quite high. And an average also doesn't tell 
you much except maybe in the extreme case that we're fixing with this patch. 
Before just adding another metric, maybe we should think through exactly what 
you're trying to measure/monitor and how best to reveal that information?

3. I probably should have given a better explanation earlier for why that 
approach is problematic. In a lot of cases, you *want* run() to use small 
timeouts. If you're pumping a lot of data through, saturating your network 
connection, then you're likely to going to need to poll very fast and end up 
with consistently small timeouts (e.g. if you produce to many topics and 
consistently have data flowing through them, you'll end up using the linger 
period and your timeout will generally be a fraction of that period). In other 
words, if we've actually given the producer a lot of work to do, then we should 
expect that it will use short timeouts and eat up CPU. You found that you could 
work around your particular issue by backing off, but it breaks another very 
important use case -- high throughput to the producer. I bet if you ran 
performance tests with it enabled you would find diminished performance. 
Fundamentally the bug wasn't that we were sometimes computing small timeouts 
consistently, it was that we were computing *incorrect* timeouts. If we get the 
timeouts computed correctly, then we'll properly support both cases.

4. See explanation in 3. The rate of this loop is controlled by correct 
computation of timeouts. I think the only case this code could be an issue is 
if we compute correct timeouts, but then consistently see exceptions that are 
only caught by this block. That shouldn't be happening consistenly, and if you 
find an example where it is then we should catch that exception deeper in the 
call stack and handle it more gracefully.

5. The thread still has to be available so that when send() is called again it 
can initiate a connection. What should happen now (after my patch) if there is 
no work to do is that we'll still run the loop, but the timeout will be very 
large so it basically won't be using any CPU. If we get a request to send data, 
a wakeup() call wakes it back up before the timeout so it can start sending 
data immediately.

6. It should be handled gracefully, and should have been fixed by the original 
patch (although maybe the fix to the ordering in initiateConnect in the second 
patch was also necessary). The logic in Sender.run() adjusts timeouts when it 
finds nodes that aren't ready to send data (i.e. are disconnected) and makes 
sure we backoff when connection attempts fail. This is a fixed backoff 
(reconnect.backoff.ms, default 10ms), but is enough to avoid high CPU usage.


> [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
> connection is lost
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1642
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1642
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 0.8.2
>            Reporter: Bhavesh Mistry
>            Assignee: Ewen Cheslack-Postava
>            Priority: Blocker
>             Fix For: 0.8.2
>
>         Attachments: 
> 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
> KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
> KAFKA-1642_2014-10-23_16:19:41.patch
>
>
> I see my CPU spike to 100% when network connection is lost for while.  It 
> seems network  IO thread are very busy logging following error message.  Is 
> this expected behavior ?
> 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
> org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
> producer I/O thread: 
> java.lang.IllegalStateException: No entry found for node -2
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
> at 
> org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
> at 
> org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:744)
> Thanks,
> Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

Reply via email to