[jira] [Comment Edited] (KAFKA-664) Kafka server threads die due to OOME during long running test

Neha Narkhede (JIRA) Fri, 07 Dec 2012 17:57:23 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526973#comment-13526973
 ]


Neha Narkhede edited comment on KAFKA-664 at 12/8/12 1:56 AM:
--------------------------------------------------------------

The problem was ever increasing requests in the watchersForKey map. Please look 
at the graph attached. In merely 40 minutes of running the broker, the number 
of requests in the purgatory map shot upto 4 million.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.

                
      was (Author: nehanarkhede):
    The problem was ever increasing requests in the watchersForKey map. Please 
look at the graph attached.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.

                  
> Kafka server threads die due to OOME during long running test
> -------------------------------------------------------------
>
>                 Key: KAFKA-664
>                 URL: https://issues.apache.org/jira/browse/KAFKA-664
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Jay Kreps
>            Priority: Blocker
>              Labels: bugs
>             Fix For: 0.8
>
>         Attachments: kafka-664-draft.patch, thread-dump.log, 
> watchersForKey.png
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (KAFKA-664) Kafka server threads die due to OOME during long running test

Reply via email to