[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

Neha Narkhede (JIRA) Thu, 13 Dec 2012 11:04:13 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531337#comment-13531337
 ]


Neha Narkhede commented on KAFKA-664:
-------------------------------------

Thanks for patch v3, Joel ! It works correctly, here are a few review 
suggestions -

1. Can we bring back the exact count of delayed requests sitting in the 
purgatory ? Your patch removed that, added a metric for the total size of 
purgatory, which is good to have as well. But its inaccuracy is inversely 
proportional to the time till the next cleanup interval.
2. It seems like the cleanup is happening on the request handler thread. Can it 
happen on the background expiry reaper thread instead ? I understand the 
latency impact is minimal, but it still seems like a good idea to put extra 
processing on a background thread, which we already have
3. CleanupThreshold can be hard coded in the future. Until then, for 
performance tuning, it will be very convenient to have it configurable :)


                
> Kafka server threads die due to OOME during long running test
> -------------------------------------------------------------
>
>                 Key: KAFKA-664
>                 URL: https://issues.apache.org/jira/browse/KAFKA-664
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Jay Kreps
>            Priority: Blocker
>              Labels: bugs
>             Fix For: 0.8
>
>         Attachments: kafka-664-draft-2.patch, kafka-664-draft.patch, 
> KAFKA-664-v3.patch, Screen Shot 2012-12-09 at 11.22.50 AM.png, Screen Shot 
> 2012-12-09 at 11.23.09 AM.png, Screen Shot 2012-12-09 at 11.31.29 AM.png, 
> thread-dump.log, watchersForKey.png
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

Reply via email to