[jira] [Updated] (KAFKA-581) provides windows batch script for starting Kafka/Zookeeper
[ https://issues.apache.org/jira/browse/KAFKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] antoine vianey updated KAFKA-581: - Attachment: zookeeper-server-stop.bat kafka-server-stop.bat Can you add this scripts to the 0.8 branch as well. This usefull when starting zookeeper and kafka from maven-antrun plugin or other that create a separate process... Zookeeper and Storm can be started and stopped in the pre and post-integration-test Regards > provides windows batch script for starting Kafka/Zookeeper > -- > > Key: KAFKA-581 > URL: https://issues.apache.org/jira/browse/KAFKA-581 > Project: Kafka > Issue Type: Improvement > Components: config >Affects Versions: 0.8 > Environment: Windows >Reporter: antoine vianey >Priority: Trivial > Labels: features, run, windows > Fix For: 0.8 > > Attachments: kafka-console-consumer.bat, kafka-console-producer.bat, > kafka-run-class.bat, kafka-server-start.bat, kafka-server-stop.bat, sbt.bat, > zookeeper-server-start.bat, zookeeper-server-stop.bat > > Original Estimate: 24h > Remaining Estimate: 24h > > Provide a port for quickstarting Kafka dev on Windows : > - kafka-run-class.bat > - kafka-server-start.bat > - zookeeper-server-start.bat > This will help Kafka community growth -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (KAFKA-581) provides windows batch script for starting Kafka/Zookeeper
[ https://issues.apache.org/jira/browse/KAFKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526440#comment-13526440 ] antoine vianey edited comment on KAFKA-581 at 12/7/12 3:11 PM: --- *Added the stop scripts* Can you add this scripts to the 0.8 branch as well. This usefull when starting zookeeper and kafka from maven-antrun plugin or other that create a separate process... Zookeeper and Storm can be started and stopped in the pre and post-integration-test Regards was (Author: avianey): Can you add this scripts to the 0.8 branch as well. This usefull when starting zookeeper and kafka from maven-antrun plugin or other that create a separate process... Zookeeper and Storm can be started and stopped in the pre and post-integration-test Regards > provides windows batch script for starting Kafka/Zookeeper > -- > > Key: KAFKA-581 > URL: https://issues.apache.org/jira/browse/KAFKA-581 > Project: Kafka > Issue Type: Improvement > Components: config >Affects Versions: 0.8 > Environment: Windows >Reporter: antoine vianey >Priority: Trivial > Labels: features, run, windows > Fix For: 0.8 > > Attachments: kafka-console-consumer.bat, kafka-console-producer.bat, > kafka-run-class.bat, kafka-server-start.bat, kafka-server-stop.bat, sbt.bat, > zookeeper-server-start.bat, zookeeper-server-stop.bat > > Original Estimate: 24h > Remaining Estimate: 24h > > Provide a port for quickstarting Kafka dev on Windows : > - kafka-run-class.bat > - kafka-server-start.bat > - zookeeper-server-start.bat > This will help Kafka community growth -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (KAFKA-581) provides windows batch script for starting Kafka/Zookeeper
[ https://issues.apache.org/jira/browse/KAFKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526440#comment-13526440 ] antoine vianey edited comment on KAFKA-581 at 12/7/12 3:11 PM: --- *Added the server-stop scripts* Can you add this scripts to the 0.8 branch as well. This usefull when starting zookeeper and kafka from maven-antrun plugin or other that create a separate process... Zookeeper and Storm can be started and stopped in the pre and post-integration-test Regards was (Author: avianey): *Added the stop scripts* Can you add this scripts to the 0.8 branch as well. This usefull when starting zookeeper and kafka from maven-antrun plugin or other that create a separate process... Zookeeper and Storm can be started and stopped in the pre and post-integration-test Regards > provides windows batch script for starting Kafka/Zookeeper > -- > > Key: KAFKA-581 > URL: https://issues.apache.org/jira/browse/KAFKA-581 > Project: Kafka > Issue Type: Improvement > Components: config >Affects Versions: 0.8 > Environment: Windows >Reporter: antoine vianey >Priority: Trivial > Labels: features, run, windows > Fix For: 0.8 > > Attachments: kafka-console-consumer.bat, kafka-console-producer.bat, > kafka-run-class.bat, kafka-server-start.bat, kafka-server-stop.bat, sbt.bat, > zookeeper-server-start.bat, zookeeper-server-stop.bat > > Original Estimate: 24h > Remaining Estimate: 24h > > Provide a port for quickstarting Kafka dev on Windows : > - kafka-run-class.bat > - kafka-server-start.bat > - zookeeper-server-start.bat > This will help Kafka community growth -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
0.8/HEAD Console consumer breakage?
So I was testing my own code, and using the console consumer against my seemingly-working-producer code. Since the last update, the console consumer crashes. I am going to try to track it down in the debugger and will come back with a patch if found.
Re: 0.8/HEAD Console consumer breakage?
Dah. Misfire. Please ignore if this makes it to an inbox ;) -b On Fri, Dec 7, 2012 at 4:13 PM, ben fleis wrote: > So I was testing my own code, and using the console consumer against my > seemingly-working-producer code. Since the last update, the console > consumer crashes. I am going to try to track it down in the debugger and > will come back with a patch if found. > > > > > > >
Re: 0.8 Protocol Status
I was testing my own code, and using the console consumer against my seemingly-working-producer code. Since the last update, the console consumer crashes. I am going to try to track it down in the debugger and will come back with a patch if found. Command line: KAFKA_OPTS="-Xmx512M -server -Dlog4j.configuration=file:$PWD/config/log4j.properties -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=4244" \ bin/kafka-console-consumer.sh config/consumer.properties --zookeeper localhost:2181 --topic types Stacktrace: [2012-12-07 16:11:34,421] ERROR Error processing message, stopping consumer: (kafka.consumer.ConsoleConsumer$) java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:247) at kafka.message.Message.sliceDelimited(Message.scala:225) at kafka.message.Message.payload(Message.scala:207) at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:110) at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33) at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:61) at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:53) at scala.collection.Iterator$class.foreach(Iterator.scala:631) at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32) at scala.collection.IterableLike$class.foreach(IterableLike.scala:79) at kafka.consumer.KafkaStream.foreach(KafkaStream.scala:25) at kafka.consumer.ConsoleConsumer$.main(ConsoleConsumer.scala:189) at kafka.consumer.ConsoleConsumer.main(ConsoleConsumer.scala) All advice gladly accepted, including "you blew it, you blind fool!" ;) -b
[jira] [Created] (KAFKA-662) Create testcases for unclean shut down
John Fung created KAFKA-662: --- Summary: Create testcases for unclean shut down Key: KAFKA-662 URL: https://issues.apache.org/jira/browse/KAFKA-662 Project: Kafka Issue Type: Bug Reporter: John Fung Assignee: John Fung -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (KAFKA-663) Add "deploy" feature to System Test
John Fung created KAFKA-663: --- Summary: Add "deploy" feature to System Test Key: KAFKA-663 URL: https://issues.apache.org/jira/browse/KAFKA-663 Project: Kafka Issue Type: Task Reporter: John Fung Assignee: John Fung -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-662) Create testcases for unclean shut down
[ https://issues.apache.org/jira/browse/KAFKA-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Fung updated KAFKA-662: Issue Type: Task (was: Bug) > Create testcases for unclean shut down > -- > > Key: KAFKA-662 > URL: https://issues.apache.org/jira/browse/KAFKA-662 > Project: Kafka > Issue Type: Task >Reporter: John Fung >Assignee: John Fung > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 Protocol Status
Ben, I would try to run DumpLogSegments to check if the server's data is not corrupted due to a bug in the producer. Thanks, Neha On Fri, Dec 7, 2012 at 7:17 AM, ben fleis wrote: > I was testing my own code, and using the console consumer against my > seemingly-working-producer code. Since the last update, the console > consumer crashes. I am going to try to track it down in the debugger and > will come back with a patch if found. > > Command line: > KAFKA_OPTS="-Xmx512M -server > -Dlog4j.configuration=file:$PWD/config/log4j.properties -Xdebug > -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=4244" \ > bin/kafka-console-consumer.sh config/consumer.properties --zookeeper > localhost:2181 --topic types > > Stacktrace: > [2012-12-07 16:11:34,421] ERROR Error processing message, stopping > consumer: (kafka.consumer.ConsoleConsumer$) > java.lang.IllegalArgumentException > at java.nio.Buffer.limit(Buffer.java:247) > at kafka.message.Message.sliceDelimited(Message.scala:225) > at kafka.message.Message.payload(Message.scala:207) > at > kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:110) > at > kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33) > at > kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:61) > at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:53) > at scala.collection.Iterator$class.foreach(Iterator.scala:631) > at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32) > at > scala.collection.IterableLike$class.foreach(IterableLike.scala:79) > at kafka.consumer.KafkaStream.foreach(KafkaStream.scala:25) > at kafka.consumer.ConsoleConsumer$.main(ConsoleConsumer.scala:189) > at kafka.consumer.ConsoleConsumer.main(ConsoleConsumer.scala) > > All advice gladly accepted, including "you blew it, you blind fool!" ;) > > -b
[jira] [Created] (KAFKA-664) Kafka server threads die due to OOME during long running test
Neha Narkhede created KAFKA-664: --- Summary: Kafka server threads die due to OOME during long running test Key: KAFKA-664 URL: https://issues.apache.org/jira/browse/KAFKA-664 Project: Kafka Issue Type: Bug Affects Versions: 0.8 Reporter: Neha Narkhede Priority: Blocker Fix For: 0.8 I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long running producer process that sends data to 100s of partitions continuously for ~15 hours. After ~4 hours of operation, few server threads (acceptor and processor) exited due to OOME - [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 (kafka.network.BoundedByteBufferReceive) java.lang.OutOfMemoryError: Java heap space [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 'kafka-acceptor': (kafka.utils.Utils$) java.lang.OutOfMemoryError: Java heap space [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 'kafka-processor-9092-1': (kafka.utils.Utils$) java.lang.OutOfMemoryError: Java heap space [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, session 0x13afd0753870103 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2012-12-07 08:24:46,344] INFO Initiating client connection, connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 (org.apache.zookeeper.ZooKeeper) [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 (kafka.network.BoundedByteBufferReceive) java.lang.OutOfMemoryError: Java heap space [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 'kafka-request-handler-0': (kafka.utils.Utils$) java.lang.OutOfMemoryError: Java heap space [2012-12-07 08:25:08,739] INFO Opening socket connection to server eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) [2012-12-07 08:25:14,221] INFO Socket connection established to eat1-app311.corp/172.20.72.75:12913, initiating session (org.apache.zookeeper.ClientCnxn) [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from server in 3722ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) java.lang.OutOfMemoryError: Java heap space [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 (kafka.network.BoundedByteBufferReceive) java.lang.OutOfMemoryError: Java heap space It seems like it runs out of memory while trying to read the producer request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-664: Attachment: thread-dump.log Attaching a thread dump that shows - 1. 4 processor threads and the acceptor threads are dead 2. Rest of the processor threads have a full request queue, and they are waiting to add to the request queue. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526590#comment-13526590 ] Neha Narkhede commented on KAFKA-664: - Another observation - The server is probably GCing quite a lot, since I see the following in the server logs - [2012-12-07 09:32:14,742] INFO Client session timed out, have not heard from server in 1204905ms for sessionid 0x23afd074d6600ea, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) The zookeeper session timeout is pretty high (15secs) and it is in the same DC as the Kafka cluster and the producer > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-651) Create testcases on auto create topics
[ https://issues.apache.org/jira/browse/KAFKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Fung updated KAFKA-651: Status: Patch Available (was: Open) > Create testcases on auto create topics > -- > > Key: KAFKA-651 > URL: https://issues.apache.org/jira/browse/KAFKA-651 > Project: Kafka > Issue Type: Task >Reporter: John Fung > Labels: replication-testing > Attachments: kafka-651-v1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-651) Create testcases on auto create topics
[ https://issues.apache.org/jira/browse/KAFKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Fung updated KAFKA-651: Attachment: kafka-651-v1.patch Uploaded kafka-651-v1.patch with 1 testcase to cover each functional group: testcase_0011 testcase_0024 testcase_0119 testcase_0128 testcase_0134 testcase_0159 testcase_0209 testcase_0259 testcase_0309 > Create testcases on auto create topics > -- > > Key: KAFKA-651 > URL: https://issues.apache.org/jira/browse/KAFKA-651 > Project: Kafka > Issue Type: Task >Reporter: John Fung > Labels: replication-testing > Attachments: kafka-651-v1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-374) Move to java CRC32 implementation
[ https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Arthur updated KAFKA-374: --- Attachment: KAFKA-374.patch Pure-java crc32 taken from Hadoop > Move to java CRC32 implementation > - > > Key: KAFKA-374 > URL: https://issues.apache.org/jira/browse/KAFKA-374 > Project: Kafka > Issue Type: New Feature > Components: core >Affects Versions: 0.8 >Reporter: Jay Kreps >Priority: Minor > Labels: newbie > Attachments: KAFKA-374-draft.patch, KAFKA-374.patch > > > We keep a per-record crc32. This is fairly cheap algorithm, but the java > implementation uses JNI and it seems to be a bit expensive for small records. > I have seen this before in Kafka profiles, and I noticed it on another > application I was working on. Basically with small records the native > implementation can only checksum < 100MB/sec. Hadoop has done some analysis > of this and replaced it with a Java implementation that is 2x faster for > large values and 5-10x faster for small values. Details are here HADOOP-6148. > We should do a quick read/write benchmark on log and message set iteration > and see if this improves things. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-374) Move to java CRC32 implementation
[ https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526624#comment-13526624 ] David Arthur commented on KAFKA-374: Not sure how you guys feel about having Java in the source tree, but I attached a patch with the pure Java implementation (and the other stuff from [~jkreps]'s original patch). > Move to java CRC32 implementation > - > > Key: KAFKA-374 > URL: https://issues.apache.org/jira/browse/KAFKA-374 > Project: Kafka > Issue Type: New Feature > Components: core >Affects Versions: 0.8 >Reporter: Jay Kreps >Priority: Minor > Labels: newbie > Attachments: KAFKA-374-draft.patch, KAFKA-374.patch > > > We keep a per-record crc32. This is fairly cheap algorithm, but the java > implementation uses JNI and it seems to be a bit expensive for small records. > I have seen this before in Kafka profiles, and I noticed it on another > application I was working on. Basically with small records the native > implementation can only checksum < 100MB/sec. Hadoop has done some analysis > of this and replaced it with a Java implementation that is 2x faster for > large values and 5-10x faster for small values. Details are here HADOOP-6148. > We should do a quick read/write benchmark on log and message set iteration > and see if this improves things. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (KAFKA-374) Move to java CRC32 implementation
[ https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526624#comment-13526624 ] David Arthur edited comment on KAFKA-374 at 12/7/12 6:38 PM: - Not sure how you guys feel about having Java in the source tree, but I attached a patch with the pure Java implementation (and the other stuff from [~jkreps] original patch). was (Author: mumrah): Not sure how you guys feel about having Java in the source tree, but I attached a patch with the pure Java implementation (and the other stuff from [~jkreps]'s original patch). > Move to java CRC32 implementation > - > > Key: KAFKA-374 > URL: https://issues.apache.org/jira/browse/KAFKA-374 > Project: Kafka > Issue Type: New Feature > Components: core >Affects Versions: 0.8 >Reporter: Jay Kreps >Priority: Minor > Labels: newbie > Attachments: KAFKA-374-draft.patch, KAFKA-374.patch > > > We keep a per-record crc32. This is fairly cheap algorithm, but the java > implementation uses JNI and it seems to be a bit expensive for small records. > I have seen this before in Kafka profiles, and I noticed it on another > application I was working on. Basically with small records the native > implementation can only checksum < 100MB/sec. Hadoop has done some analysis > of this and replaced it with a Java implementation that is 2x faster for > large values and 5-10x faster for small values. Details are here HADOOP-6148. > We should do a quick read/write benchmark on log and message set iteration > and see if this improves things. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Deleted] (KAFKA-374) Move to java CRC32 implementation
[ https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Arthur updated KAFKA-374: --- Comment: was deleted (was: Pure-java crc32 taken from Hadoop) > Move to java CRC32 implementation > - > > Key: KAFKA-374 > URL: https://issues.apache.org/jira/browse/KAFKA-374 > Project: Kafka > Issue Type: New Feature > Components: core >Affects Versions: 0.8 >Reporter: Jay Kreps >Priority: Minor > Labels: newbie > Attachments: KAFKA-374-draft.patch, KAFKA-374.patch > > > We keep a per-record crc32. This is fairly cheap algorithm, but the java > implementation uses JNI and it seems to be a bit expensive for small records. > I have seen this before in Kafka profiles, and I noticed it on another > application I was working on. Basically with small records the native > implementation can only checksum < 100MB/sec. Hadoop has done some analysis > of this and replaced it with a Java implementation that is 2x faster for > large values and 5-10x faster for small values. Details are here HADOOP-6148. > We should do a quick read/write benchmark on log and message set iteration > and see if this improves things. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-657) Add an API to commit offsets
[ https://issues.apache.org/jira/browse/KAFKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526687#comment-13526687 ] David Arthur commented on KAFKA-657: I have started a wiki page for this design discussion https://cwiki.apache.org/confluence/display/KAFKA/Commit+Offset+API+-+Proposal > Add an API to commit offsets > > > Key: KAFKA-657 > URL: https://issues.apache.org/jira/browse/KAFKA-657 > Project: Kafka > Issue Type: New Feature >Reporter: Jay Kreps > Labels: project > > Currently the consumer directly writes their offsets to zookeeper. Two > problems with this: (1) This is a poor use of zookeeper, and we need to > replace it with a more scalable offset store, and (2) it makes it hard to > carry over to clients in other languages. A first step towards accomplishing > that is to add a proper Kafka API for committing offsets. The initial version > of this would just write to zookeeper as we do today, but in the future we > would then have the option of changing this. > This api likely needs to take a sequence of > consumer-group/topic/partition/offset entries and commit them all. > It would be good to do a wiki design on how this would work and consensus on > that first. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526713#comment-13526713 ] Jay Kreps commented on KAFKA-664: - One pain of oom is that the thing leaking the memory is not necessarily the thing that gets the exception. Can you rerun with -XX:+HeapDumpOnOutOfMemoryError > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526719#comment-13526719 ] Neha Narkhede commented on KAFKA-664: - Heap dump is here - http://people.apache.org/~nehanarkhede/kafka-misc/kafka-0.8/heap-dump.tar.gz Almost all the largest objects trace back to RequestPurgatory$ExpiredRequestReaper as the GC root. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526720#comment-13526720 ] Neha Narkhede commented on KAFKA-664: - I'm re-running the tests with that option now > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Kreps reassigned KAFKA-664: --- Assignee: Jay Kreps > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526805#comment-13526805 ] Jay Kreps commented on KAFKA-664: - Looks like the problem is in request purgatory--watchers aren't getting removed. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526808#comment-13526808 ] Neha Narkhede commented on KAFKA-664: - The root cause seems to be that watchersForKey map keeps growing. I see that we add keys to the map, but never actually delete them. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (KAFKA-665) Outgoing responses delayed on a busy Kafka broker
Neha Narkhede created KAFKA-665: --- Summary: Outgoing responses delayed on a busy Kafka broker Key: KAFKA-665 URL: https://issues.apache.org/jira/browse/KAFKA-665 Project: Kafka Issue Type: Bug Affects Versions: 0.8 Reporter: Neha Narkhede Priority: Critical Fix For: 0.8 In a long running test, I observed that after a few hours of operation, few requests start timing out, mainly because they spent very long time sitting in the response queue - [2012-12-07 22:05:56,670] TRACE Completed request with correlation id 3965966 and client : TopicMetadataRequest:4009, queueTime:1, localTime:28, remoteTime:0, sendTime:3980 (kafka.network.RequestChannel$) [2012-12-07 22:04:12,046] TRACE Completed request with correlation id 3962561 and client : TopicMetadataRequest:3449, queueTime:0, localTime:29, remoteTime:0, sendTime:3420 (kafka.network.RequestChannel$) [2012-12-07 22:05:56,670] TRACE Completed request with correlation id 3965966 and client : TopicMetadataRequest:4009, queueTime:1, localTime:28, remoteTime:0, sendTime:3980 (kafka.network.RequestChannel$) We might have a problem in the way we process outgoing responses. Basically, if the processor thread blocks on enqueuing requests in the request queue, it doesn't come around to processing its responses which are ready to go out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (KAFKA-666) Fetch requests from the replicas take several seconds to complete on the leader
Neha Narkhede created KAFKA-666: --- Summary: Fetch requests from the replicas take several seconds to complete on the leader Key: KAFKA-666 URL: https://issues.apache.org/jira/browse/KAFKA-666 Project: Kafka Issue Type: Bug Affects Versions: 0.8 Reporter: Neha Narkhede Priority: Critical Fix For: 0.8 I've seen that fetch requests from the replicas take several seconds to complete. The nature of the latency breakdown is different, sometimes they spend too long sitting in the request/response queue, sometimes the local/remote time is too large - [2012-12-07 20:59:22,424] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3502, queueTime:1, localTime:1, remoteTime:3500, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 20:59:22,611] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3301, queueTime:1, localTime:3118, remoteTime:181, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3981, queueTime:0, localTime:1, remoteTime:3979, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4095, queueTime:1, localTime:6, remoteTime:4088, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:13:57,254] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4116, queueTime:1, localTime:1, remoteTime:4113, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:13:57,300] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3844, queueTime:1, localTime:3795, remoteTime:48, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:14:19,645] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4239, queueTime:1, localTime:1, remoteTime:4236, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:14:19,689] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3977, queueTime:3931, localTime:8, remoteTime:38, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:23:58,427] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3940, queueTime:1, localTime:1, remoteTime:3938, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:23:58,435] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3858, queueTime:1, localTime:6, remoteTime:3851, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:24:21,575] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4037, queueTime:0, localTime:1, remoteTime:4036, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:24:21,583] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, queueTime:1, localTime:4, remoteTime:3956, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:24:43,965] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4294, queueTime:1, localTime:1, remoteTime:4292, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:24:44,013] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, queueTime:1, localTime:3919, remoteTime:41, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:25:06,157] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4587, queueTime:1, localTime:1, remoteTime:4585, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:25:06,162] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4276, queueTime:1, localTime:6, remoteTime:4268, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:25:28,943] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3986, queueTime:1, localTime:1, remoteTime:3984, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:25:28,953] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3940, queueTime:3929, localTime:6, remoteTime:5, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:25:51,374] TRACE Completed re
[jira] [Updated] (KAFKA-666) Fetch requests from the replicas take several seconds to complete on the leader
[ https://issues.apache.org/jira/browse/KAFKA-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-666: Description: I've seen that fetch requests from the replicas take several seconds to complete. The nature of the latency breakdown is different, sometimes they spend too long sitting in the request/response queue, sometimes the local/remote time is too large - [2012-12-07 20:14:51,233] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9967, queueTime:0, localTime:4, remoteTime:9963, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 20:14:51,236] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9967, queueTime:1, localTime:3, remoteTime:9963, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 20:14:51,239] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9966, queueTime:0, localTime:2, remoteTime:9964, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 20:16:07,643] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9996, queueTime:1, localTime:2, remoteTime:9992, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 20:16:07,645] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:, queueTime:0, localTime:4, remoteTime:9994, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 20:59:22,424] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3502, queueTime:1, localTime:1, remoteTime:3500, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3981, queueTime:0, localTime:1, remoteTime:3979, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4095, queueTime:1, localTime:6, remoteTime:4088, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:13:57,254] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4116, queueTime:1, localTime:1, remoteTime:4113, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:13:57,300] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3844, queueTime:1, localTime:3795, remoteTime:48, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:14:19,645] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4239, queueTime:1, localTime:1, remoteTime:4236, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:14:19,689] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3977, queueTime:3931, localTime:8, remoteTime:38, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:23:58,427] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3940, queueTime:1, localTime:1, remoteTime:3938, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:23:58,435] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3858, queueTime:1, localTime:6, remoteTime:3851, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:24:21,575] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4037, queueTime:0, localTime:1, remoteTime:4036, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:24:21,583] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, queueTime:1, localTime:4, remoteTime:3956, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:24:43,965] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4294, queueTime:1, localTime:1, remoteTime:4292, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:24:44,013] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, queueTime:1, localTime:3919, remoteTime:41, sendTime:1 (kafka.network.RequestChannel$) [2012-12-07 21:25:06,157] TRACE Completed request with correlation id 0 and client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4587, queueTime:1, localTime:1, remoteTime:4585, sendTime:0 (kafka.network.RequestChannel$) [2012-12-07 21:25:06,162] TRACE Completed reques
[jira] [Commented] (KAFKA-644) System Test should run properly with mixed File System Pathname
[ https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526851#comment-13526851 ] Neha Narkhede commented on KAFKA-644: - +1. Thanks for the patch ! > System Test should run properly with mixed File System Pathname > --- > > Key: KAFKA-644 > URL: https://issues.apache.org/jira/browse/KAFKA-644 > Project: Kafka > Issue Type: Task >Reporter: John Fung >Assignee: John Fung > Labels: replication-testing > Attachments: kafka-644-v1.patch > > > Currently, System Test assumes that all the entities (ZK, Broker, Producer, > Consumer) are running in machines which have the same File System Pathname as > the machine in which the System Test scripts are running. > Usually, our own local boxes would be like /home/kafka/. . . > and remote boxes may look like /mnt/. . . > In this case, System Test won't work properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-644) System Test should run properly with mixed File System Pathname
[ https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-644: Resolution: Fixed Status: Resolved (was: Patch Available) > System Test should run properly with mixed File System Pathname > --- > > Key: KAFKA-644 > URL: https://issues.apache.org/jira/browse/KAFKA-644 > Project: Kafka > Issue Type: Task >Reporter: John Fung >Assignee: John Fung > Labels: replication-testing > Attachments: kafka-644-v1.patch > > > Currently, System Test assumes that all the entities (ZK, Broker, Producer, > Consumer) are running in machines which have the same File System Pathname as > the machine in which the System Test scripts are running. > Usually, our own local boxes would be like /home/kafka/. . . > and remote boxes may look like /mnt/. . . > In this case, System Test won't work properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (KAFKA-644) System Test should run properly with mixed File System Pathname
[ https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede closed KAFKA-644. --- > System Test should run properly with mixed File System Pathname > --- > > Key: KAFKA-644 > URL: https://issues.apache.org/jira/browse/KAFKA-644 > Project: Kafka > Issue Type: Task >Reporter: John Fung >Assignee: John Fung > Labels: replication-testing > Attachments: kafka-644-v1.patch > > > Currently, System Test assumes that all the entities (ZK, Broker, Producer, > Consumer) are running in machines which have the same File System Pathname as > the machine in which the System Test scripts are running. > Usually, our own local boxes would be like /home/kafka/. . . > and remote boxes may look like /mnt/. . . > In this case, System Test won't work properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-597) Refactor KafkaScheduler
[ https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Kreps updated KAFKA-597: Attachment: KAFKA-597-v4.patch Patch v4. - Rebased - Makes use of thread factory - Fixed broken scaladoc > Refactor KafkaScheduler > --- > > Key: KAFKA-597 > URL: https://issues.apache.org/jira/browse/KAFKA-597 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Jay Kreps >Priority: Minor > Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, > KAFKA-597-v3.patch, KAFKA-597-v4.patch > > > It would be nice to cleanup KafkaScheduler. Here is what I am thinking > Extract the following interface: > trait Scheduler { > def startup() > def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: > Long): Scheduled > def shutdown(interrupt: Boolean = false) > } > class Scheduled { > def lastExecution: Long > def cancel() > } > We would have two implementations, KafkaScheduler and MockScheduler. > KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. > MockScheduler would only allow manual time advancement rather than using the > system clock, we would switch unit tests over to this. > This change would be different from the existing scheduler in a the following > ways: > 1. Would not return a ScheduledFuture (since this is useless) > 2. shutdown() would be a blocking call. The current shutdown calls, don't > really do what people want. > 3. We would remove the daemon thread flag, as I don't think it works. > 4. It returns an object which let's you cancel the job or get the last > execution time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526876#comment-13526876 ] Joel Koshy commented on KAFKA-664: -- To clarify, the map itself shouldn't grow indefinitely right? - i.e., if there are no new partitions the number of keys should be the same. I think the issue is that expired requests (for a key) are not removed from the list of outstanding requests for that key. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-636) Make log segment delete asynchronous
[ https://issues.apache.org/jira/browse/KAFKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Kreps updated KAFKA-636: Attachment: KAFKA-636-v1.patch This patch implements asynchronous delete in the log. To do this Log.scala now requires a scheduler to be used for scheduling the deletions. The deletion works as described above. The locking for segment deletion can now be more aggressive since the file renames are assumed to be fast they can be inside the lock. As part of testing this I also found a problem with MockScheduler, namely that it does not reentrant. That is, if scheduled tasks themselves create scheduled tasks it misbehaves. To fix this I rewrote MockScheduler to use a priority queue. The code is simpler and more correct since it now performs all executions in the correct order too. > Make log segment delete asynchronous > > > Key: KAFKA-636 > URL: https://issues.apache.org/jira/browse/KAFKA-636 > Project: Kafka > Issue Type: Bug >Reporter: Jay Kreps >Assignee: Jay Kreps > Attachments: KAFKA-636-v1.patch > > > We have a few corner-case bugs around delete of segment files: > 1. It is possible for delete and truncate to kind of cross streams and end up > with a case where you have no segments. > 2. Reads on the log have no locking (which is good) but as a result deleting > a segment that is being read will result in some kind of I/O exception. > 3. We can't easily fix the synchronization problems without deleting files > inside the log's write lock. This can be a problem as deleting a 2GB segment > can take a couple of seconds even on an unloaded system. > The proposed fix for these problems is to make file removal asynchronous > using the following scheme as the new delete scheme: > 1. Immediately remove the file from segment map and rename the file from X to > X.deleted (e.g. 000.log to 00.log.deleted. We think renaming a file > will not impact reads since the file is already open and hence the name is > irrelevant. This will always be O(1) and can be done inside the write lock. > 2. Schedule a future operation to delete the file. The time to wait would be > configurable but we would just default it to 60 seconds and probably no one > would ever change it. > 3. On startup we would delete any files with the .deleted suffix as they > would have been pending deletes that didn't take place. > I plan to do this soon working against the refactored log (KAFKA-521). We can > opt to back port the patch for 0.8 if we are feeling daring. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526883#comment-13526883 ] Joel Koshy commented on KAFKA-664: -- Okay I'm slightly confused. Even on expiration the request is marked as satisfied. So even if it is not removed from the watcher's list during expiration it will be removed on the next call to collectSatisfiedRequests - which in this case will be when the next produce request arrives to that partition. Which means this should only be due to low-volume partitions that are no longer growing. i.e., the replica fetcher would keep issuing fetch requests that keep expiring but never get removed from the list of pending requests in watchersFor(the-low-volume-partition). > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526887#comment-13526887 ] Jay Kreps commented on KAFKA-664: - Another issue is that we are saving the full producer request in memory for as long as it is in purgatory. Not sure that is causing this, but that is pretty bad. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: thread-dump.log > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-644) System Test should run properly with mixed File System Pathname
[ https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Fung updated KAFKA-644: Attachment: kafka-644-v2.patch Uploaded kafka-644-v2.patch which supports the property "auto_create_topic" > System Test should run properly with mixed File System Pathname > --- > > Key: KAFKA-644 > URL: https://issues.apache.org/jira/browse/KAFKA-644 > Project: Kafka > Issue Type: Task >Reporter: John Fung >Assignee: John Fung > Labels: replication-testing > Attachments: kafka-644-v1.patch, kafka-644-v2.patch > > > Currently, System Test assumes that all the entities (ZK, Broker, Producer, > Consumer) are running in machines which have the same File System Pathname as > the machine in which the System Test scripts are running. > Usually, our own local boxes would be like /home/kafka/. . . > and remote boxes may look like /mnt/. . . > In this case, System Test won't work properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level
[ https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Swapnil Ghike updated KAFKA-646: Attachment: kafka-646-patch-num1-v1.patch This patch has a bunch of refactoring changes and a couple of new additions. Addressing Jun's comments: These are all great catches! Thanks for being so thorough. 60. By default, metrics-core will return an existing metric object of the same name using a getOrCreate() like functionality. As discussed offline, we should fail the clients that use an already registered clientId name. We will need to create two objects thaty contain hashmaps to record the existing producer and consumer clientIds and methods to throw an exception if a client attempts to use an existing clientId. I worked on this change a bit, but it breaks a lot of our unit tests (about half) and the refactoring will take some time. Hence, I think it will be better if I submit a patch for all other changes and create another patch for this issue under this jira. Until then we can keep this jira open. 61. For recording stats about all topics, I am now using a string "All.Topics". Since '.' is not allowed in the legal character set for topic names, this will differentiate from a topic named AllTopics. 62. Yes, we should validate groupId. Added the functionality and a unit test. It has the same validation rules as ClientId. 63. A metric name is something like (clientId + topic + some string) and this entire string is limited by fillename size. We already allow topic name to be at most 255 bytes long. We could fix max lengths for each of clientId, groupId, topic name so that the metric name never exceeds filename size. But those lengths will be quite arbitrary, perhaps we should skip the check on the length of clientId and groupId. 64. Removed brokerInfo from the clientId used to instantiate FetchRequestBuilder. Refactoring: 1. Moved validation of clientId at the end of instantiation of ProducerConfig and ConsumerConfig. - Created static objects ProducerConfig and ConsumerConfig which contain a validate() method. 2. Created global *Registry objects in which each high level Producer and Consumer can register their *stats objects. - These objects are registered in the static object only once using utils.Pool.getAndMaybePut functionality. - This will remove the need to pass *stats objects around the code in constructors (I thought having the metrics objects right up in the constructors was a bit intrusive, since one doesn't quite always think about the monitoring mechanism while instantiating various modules of the program, for example while unit testing.) - Instead of the constructor, each concerned class obtains the *Stats objects from the global registry object. - This cleans up any metrics objects created in the unit tests. - Special mention: The producer constructors are back to the old themselves. With clientId validation moved to *Config objects, the intermediate Producer constructor that merely separated the parameters of a quadruplet is gone. 3. Created separate files - for ProducerStats, ProducerTopicStats, ProducerRequestStats in kafka.producer package and for FetchRequestAndResponseStats in kafka.consumer package. Thought it was appropriate given that we already had ConsumerTopicStats in a separate file, and since the code for metrics had increased in size due to addition of *Registry and Aggregated* objects. Added comments. - for objects Topic, ClientId and GroupId in kafka.utils package. - to move the helper case classes ClientIdAndTopic, ClientIdAndBroker to kafka.common package. 4. Renamed a few variables to easier names (anyOldName to "metricId" change). New additions: 1. Added two objects to aggregate metrics recorded by SyncProducers and SimpleConsumers at the high level Producer and Consumer. - For this, changed KafkaTimer to accept a list of Timers. Typically we will pass a specificTimer and a globalTimer to this KafkaTimer class. Created a new KafkaHistogram in a similar way. 2. Validation of groupId. Issues: 1. Initializing the aggregator metrics with default values: For example, let's say that a syncProducer could be created (which will register a ProducerRequestStats mbean for this syncProducer). However, if no request is sent by this syncProducer then the absense of its data is not reflected in the aggregator histogram. For instance, the min requestSize for the syncProducer that never sent a request will be 0, but this won't be accurately represented in the aggregator histogram. Thus, we need to understand that if the request count of a syncProducer is 0, then its data will not be accurately reflected in the aggregator histogram. The question is whether it is possible to inform the aggregator histogram of some default values without increasing the request count of any syncProducer or the aggregated stats. Further
[jira] [Commented] (KAFKA-597) Refactor KafkaScheduler
[ https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526966#comment-13526966 ] Joel Koshy commented on KAFKA-597: -- I haven't fully reviewed, but a couple of initial comments: - I think the javadoc on KafkaScheduler's daemon param is a bit misleading as it currently suggests that daemon=true would prevent the VM from shutting down. - The patch inverts the daemon flag on some of the existing usages of KafkaScheduler - i.e., daemon now defaults to true and there are some places where daemon was false. We would need to survey these usages and identify whether it makes sense to keep them non-daemon or not. - The other question is on shutdownNow: the previous scheduler allowed the relaxed shutdown - i.e., don't interrupt threads that are currently executing. This change forces all shutdowns to use shutdownNow. Question is whether there are existing tasks that need to complete that would not tolerate an interrupt. I'm not sure about that - we'll need to look at existing usages. E.g., KafkaServer's kafkaScheduler used the shutdown() method - now it's effectively shutdownNow. > Refactor KafkaScheduler > --- > > Key: KAFKA-597 > URL: https://issues.apache.org/jira/browse/KAFKA-597 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Jay Kreps >Priority: Minor > Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, > KAFKA-597-v3.patch, KAFKA-597-v4.patch > > > It would be nice to cleanup KafkaScheduler. Here is what I am thinking > Extract the following interface: > trait Scheduler { > def startup() > def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: > Long): Scheduled > def shutdown(interrupt: Boolean = false) > } > class Scheduled { > def lastExecution: Long > def cancel() > } > We would have two implementations, KafkaScheduler and MockScheduler. > KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. > MockScheduler would only allow manual time advancement rather than using the > system clock, we would switch unit tests over to this. > This change would be different from the existing scheduler in a the following > ways: > 1. Would not return a ScheduledFuture (since this is useless) > 2. shutdown() would be a blocking call. The current shutdown calls, don't > really do what people want. > 3. We would remove the daemon thread flag, as I don't think it works. > 4. It returns an object which let's you cancel the job or get the last > execution time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neha Narkhede updated KAFKA-664: Attachment: watchersForKey.png kafka-664-draft.patch The problem was ever increasing requests in the watchersForKey map. Please look at the graph attached. This can happen for very low volume topics since the replica fetcher requests keep entering this map, and since there are no more produce requests coming for those topics/partitions, no one ever removes those requests from the map. With Joel's help, hacked RequestPurgatory to force the cleanup of expired/satisfied requests by the expiry thread inside purgeSatisfied. Of course, a better solution is re-designing the purgatory data structure to point from the queue to the map, but that is a bigger change. I just want to get around this issue and continue performance testing. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: kafka-664-draft.patch, thread-dump.log, > watchersForKey.png > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526973#comment-13526973 ] Neha Narkhede edited comment on KAFKA-664 at 12/8/12 1:56 AM: -- The problem was ever increasing requests in the watchersForKey map. Please look at the graph attached. In merely 40 minutes of running the broker, the number of requests in the purgatory map shot upto 4 million. This can happen for very low volume topics since the replica fetcher requests keep entering this map, and since there are no more produce requests coming for those topics/partitions, no one ever removes those requests from the map. With Joel's help, hacked RequestPurgatory to force the cleanup of expired/satisfied requests by the expiry thread inside purgeSatisfied. Of course, a better solution is re-designing the purgatory data structure to point from the queue to the map, but that is a bigger change. I just want to get around this issue and continue performance testing. was (Author: nehanarkhede): The problem was ever increasing requests in the watchersForKey map. Please look at the graph attached. This can happen for very low volume topics since the replica fetcher requests keep entering this map, and since there are no more produce requests coming for those topics/partitions, no one ever removes those requests from the map. With Joel's help, hacked RequestPurgatory to force the cleanup of expired/satisfied requests by the expiry thread inside purgeSatisfied. Of course, a better solution is re-designing the purgatory data structure to point from the queue to the map, but that is a bigger change. I just want to get around this issue and continue performance testing. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: kafka-664-draft.patch, thread-dump.log, > watchersForKey.png > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JI
[jira] [Commented] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level
[ https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527019#comment-13527019 ] Swapnil Ghike commented on KAFKA-646: - Actually I just realized that the Aggregated*Stats objects that I have created are not consistent with the way "All.Topics-MessageRate" is measured. It is possible to measure "All.Brokers-producerRequestSize" in a similar way in ProducerRequestStats. But it's not possible to measure "All.Brokers-ProduceRequestRateAndTimeMs" in the same manner since we use a timer block. To make everything look consistent, I can delete the aggregator objects from my v1 patch and create a KafkaMeter class that accepts a list of meters. Will upload another version of patch. > Provide aggregate stats at the high level Producer and > ZookeeperConsumerConnector level > --- > > Key: KAFKA-646 > URL: https://issues.apache.org/jira/browse/KAFKA-646 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Swapnil Ghike >Assignee: Swapnil Ghike >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: kafka-646-patch-num1-v1.patch > > > WIth KAFKA-622, we measure ProducerRequestStats and > FetchRequestAndResponseStats at the SyncProducer and SimpleConsumer level > respectively. We could also aggregate them in the high level Producer and > ZookeeperConsumerConnector level to provide an overall sense of > request/response rate/size at the client level. Currently, I am not > completely clear about the math that might be necessary for such aggregation > or if metrics already provides an API for aggregating stats of the same type. > We should also address the comments by Jun at KAFKA-622, I am copy pasting > them here: > 60. What happens if have 2 instances of Consumers with the same clientid in > the same jvm? Does one of them fail because it fails to register metrics? > Ditto for Producers. > 61. ConsumerTopicStats: What if a topic is named AllTopics? We use to handle > this by adding a - in topic specific stats. > 62. ZookeeperConsumerConnector: Do we need to validate groupid? > 63. ClientId: Does the clientid length need to be different from topic length? > 64. AbstractFetcherThread: When building a fetch request, do we need to pass > in brokerInfo as part of the client id? BrokerInfo contains the source broker > info and the fetch requests are always made to the source broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-597) Refactor KafkaScheduler
[ https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Kreps updated KAFKA-597: Attachment: KAFKA-597-v5.patch Thanks, new patch v5 addresses your comments: - Improved javadoc - This is actually good. I thought about it a bit and since I am making shutdown block the only time daemon vs non-daemon comes into play is if you don't call shutdown. If that is the case non-daemon threads will prevent garbage collection of the scheduler tasks and eventually block shutdown of the jvm, which seems unnecessary. - The change to shutdownNow is not good. This will invoke interrupt on all threads, which is too aggressive. Better to let them finish. If we end up needing to schedule long-running tasks we can invent a new notification mechanism. I changed this so that we use normal shutdown instead. > Refactor KafkaScheduler > --- > > Key: KAFKA-597 > URL: https://issues.apache.org/jira/browse/KAFKA-597 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Jay Kreps >Priority: Minor > Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, > KAFKA-597-v3.patch, KAFKA-597-v4.patch, KAFKA-597-v5.patch > > > It would be nice to cleanup KafkaScheduler. Here is what I am thinking > Extract the following interface: > trait Scheduler { > def startup() > def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: > Long): Scheduled > def shutdown(interrupt: Boolean = false) > } > class Scheduled { > def lastExecution: Long > def cancel() > } > We would have two implementations, KafkaScheduler and MockScheduler. > KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. > MockScheduler would only allow manual time advancement rather than using the > system clock, we would switch unit tests over to this. > This change would be different from the existing scheduler in a the following > ways: > 1. Would not return a ScheduledFuture (since this is useless) > 2. shutdown() would be a blocking call. The current shutdown calls, don't > really do what people want. > 3. We would remove the daemon thread flag, as I don't think it works. > 4. It returns an object which let's you cancel the job or get the last > execution time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test
[ https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527039#comment-13527039 ] Joel Koshy commented on KAFKA-664: -- +1 Some minor comments: - We can probably remove the WatchersForKey gauge (or maybe keep it until the RequestPurgatory refactoring is done). - While I agree we should definitely refactor the RequestPurgatory to fix the inefficient scan, I think this approach is not as hacky as it sounds. i.e., on fetch request expiration, this now does what would have been done if a produce request to that key had arrived; so we can consider the overhead of this approach as sending additional produce requests to the affected partition at the rate of fetch expirations (which by default is 2/sec). We can optimize a bit more, by adding a threshold for cleanup. i.e., do the iteration and check/removal only if watchers.requests.size > threshold. > Kafka server threads die due to OOME during long running test > - > > Key: KAFKA-664 > URL: https://issues.apache.org/jira/browse/KAFKA-664 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Neha Narkhede >Assignee: Jay Kreps >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: kafka-664-draft.patch, thread-dump.log, > watchersForKey.png > > > I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long > running producer process that sends data to 100s of partitions continuously > for ~15 hours. After ~4 hours of operation, few server threads (acceptor and > processor) exited due to OOME - > [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-acceptor': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread > 'kafka-processor-9092-1': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, > session 0x13afd0753870103 has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2012-12-07 08:24:46,344] INFO Initiating client connection, > connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913 > sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 > (org.apache.zookeeper.ZooKeeper) > [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread > 'kafka-request-handler-0': (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:08,739] INFO Opening socket connection to server > eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:14,221] INFO Socket connection established to > eat1-app311.corp/172.20.72.75:12913, initiating session > (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from > server in 3722ms for sessionid 0x0, closing socket connection and attempting > reconnect (org.apache.zookeeper.ClientCnxn) > [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$) > java.lang.OutOfMemoryError: Java heap space > [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 > (kafka.network.BoundedByteBufferReceive) > java.lang.OutOfMemoryError: Java heap space > It seems like it runs out of memory while trying to read the producer > request, but its unclear so far. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level
[ https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Swapnil Ghike updated KAFKA-646: Attachment: kafka-646-patch-num1-v2.patch Attached patch v2. The changes from patch v1: 1. Deleted KafkaHistogram class. (There is also no need for KafkaTImer class.) 2. Deleted the Aggregated*Stats objects. - The metrics of SyncProducer and SimpleConsumer for different brokers are aggregated together using the same way the producerTopicStats are aggregated for "All.Topics". - Measuring the time for produce requests and fetch requests is achieved by passing a list of timers to KafkaTimer. > Provide aggregate stats at the high level Producer and > ZookeeperConsumerConnector level > --- > > Key: KAFKA-646 > URL: https://issues.apache.org/jira/browse/KAFKA-646 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Swapnil Ghike >Assignee: Swapnil Ghike >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: kafka-646-patch-num1-v1.patch, > kafka-646-patch-num1-v2.patch > > > WIth KAFKA-622, we measure ProducerRequestStats and > FetchRequestAndResponseStats at the SyncProducer and SimpleConsumer level > respectively. We could also aggregate them in the high level Producer and > ZookeeperConsumerConnector level to provide an overall sense of > request/response rate/size at the client level. Currently, I am not > completely clear about the math that might be necessary for such aggregation > or if metrics already provides an API for aggregating stats of the same type. > We should also address the comments by Jun at KAFKA-622, I am copy pasting > them here: > 60. What happens if have 2 instances of Consumers with the same clientid in > the same jvm? Does one of them fail because it fails to register metrics? > Ditto for Producers. > 61. ConsumerTopicStats: What if a topic is named AllTopics? We use to handle > this by adding a - in topic specific stats. > 62. ZookeeperConsumerConnector: Do we need to validate groupid? > 63. ClientId: Does the clientid length need to be different from topic length? > 64. AbstractFetcherThread: When building a fetch request, do we need to pass > in brokerInfo as part of the client id? BrokerInfo contains the source broker > info and the fetch requests are always made to the source broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level
[ https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527078#comment-13527078 ] Swapnil Ghike edited comment on KAFKA-646 at 12/8/12 6:17 AM: -- Attached patch v2. The changes from patch v1: 1. Deleted KafkaHistogram class. (There is also no need for KafkaMeter class.) 2. Deleted the Aggregated*Stats objects. - The metrics of SyncProducer and SimpleConsumer for different brokers are aggregated together using the same way the producerTopicStats are aggregated for "All.Topics". - Measuring the time for produce requests and fetch requests is achieved by passing a list of timers to KafkaTimer. was (Author: swapnilghike): Attached patch v2. The changes from patch v1: 1. Deleted KafkaHistogram class. (There is also no need for KafkaTImer class.) 2. Deleted the Aggregated*Stats objects. - The metrics of SyncProducer and SimpleConsumer for different brokers are aggregated together using the same way the producerTopicStats are aggregated for "All.Topics". - Measuring the time for produce requests and fetch requests is achieved by passing a list of timers to KafkaTimer. > Provide aggregate stats at the high level Producer and > ZookeeperConsumerConnector level > --- > > Key: KAFKA-646 > URL: https://issues.apache.org/jira/browse/KAFKA-646 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Swapnil Ghike >Assignee: Swapnil Ghike >Priority: Blocker > Labels: bugs > Fix For: 0.8 > > Attachments: kafka-646-patch-num1-v1.patch, > kafka-646-patch-num1-v2.patch > > > WIth KAFKA-622, we measure ProducerRequestStats and > FetchRequestAndResponseStats at the SyncProducer and SimpleConsumer level > respectively. We could also aggregate them in the high level Producer and > ZookeeperConsumerConnector level to provide an overall sense of > request/response rate/size at the client level. Currently, I am not > completely clear about the math that might be necessary for such aggregation > or if metrics already provides an API for aggregating stats of the same type. > We should also address the comments by Jun at KAFKA-622, I am copy pasting > them here: > 60. What happens if have 2 instances of Consumers with the same clientid in > the same jvm? Does one of them fail because it fails to register metrics? > Ditto for Producers. > 61. ConsumerTopicStats: What if a topic is named AllTopics? We use to handle > this by adding a - in topic specific stats. > 62. ZookeeperConsumerConnector: Do we need to validate groupid? > 63. ClientId: Does the clientid length need to be different from topic length? > 64. AbstractFetcherThread: When building a fetch request, do we need to pass > in brokerInfo as part of the client id? BrokerInfo contains the source broker > info and the fetch requests are always made to the source broker. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-597) Refactor KafkaScheduler
[ https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527084#comment-13527084 ] Joel Koshy commented on KAFKA-597: -- On daemon vs non-daemon and shutdown vs shutdownNow. I may be misunderstanding the javadoc but I think: since the default is now daemon=true and you switched to use shutdown, VM shutdown can continue even in the middle of a scheduled task like checkpointing high watermarks or cleaning up logs. i.e., there may be such scenarios where it makes sense to make them non-daemon - i.e., set it as a non-daemon, and use shutdown (not shutdownNow - or use shutdownNow and handle InterruptedException properly in the task) to let them finish gracefully. Otherwise (iiuc) it seems if we call shutdown on the executor the VM could exit and simply kill (i.e., abruptly terminate) any running task that was started by the executor in one of the (daemon) threads from its pool. Minor comments: - Line 81 of KafkaScheduler: closing brace is mis-aligned. - The scaladoc on MockScheduler uses a non-existent schedule variant - i.e., I think you intended to add a period < 0 no? > Refactor KafkaScheduler > --- > > Key: KAFKA-597 > URL: https://issues.apache.org/jira/browse/KAFKA-597 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Jay Kreps >Priority: Minor > Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, > KAFKA-597-v3.patch, KAFKA-597-v4.patch, KAFKA-597-v5.patch > > > It would be nice to cleanup KafkaScheduler. Here is what I am thinking > Extract the following interface: > trait Scheduler { > def startup() > def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: > Long): Scheduled > def shutdown(interrupt: Boolean = false) > } > class Scheduled { > def lastExecution: Long > def cancel() > } > We would have two implementations, KafkaScheduler and MockScheduler. > KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. > MockScheduler would only allow manual time advancement rather than using the > system clock, we would switch unit tests over to this. > This change would be different from the existing scheduler in a the following > ways: > 1. Would not return a ScheduledFuture (since this is useless) > 2. shutdown() would be a blocking call. The current shutdown calls, don't > really do what people want. > 3. We would remove the daemon thread flag, as I don't think it works. > 4. It returns an object which let's you cancel the job or get the last > execution time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira