[jira] [Updated] (KAFKA-581) provides windows batch script for starting Kafka/Zookeeper

2012-12-07 Thread antoine vianey (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

antoine vianey updated KAFKA-581:
-

Attachment: zookeeper-server-stop.bat
kafka-server-stop.bat

Can you add this scripts to the 0.8 branch as well.
This usefull when starting zookeeper and kafka from maven-antrun plugin or 
other that create a separate process... Zookeeper and Storm can be started and 
stopped in the pre and post-integration-test

Regards

> provides windows batch script for starting Kafka/Zookeeper
> --
>
> Key: KAFKA-581
> URL: https://issues.apache.org/jira/browse/KAFKA-581
> Project: Kafka
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 0.8
> Environment: Windows
>Reporter: antoine vianey
>Priority: Trivial
>  Labels: features, run, windows
> Fix For: 0.8
>
> Attachments: kafka-console-consumer.bat, kafka-console-producer.bat, 
> kafka-run-class.bat, kafka-server-start.bat, kafka-server-stop.bat, sbt.bat, 
> zookeeper-server-start.bat, zookeeper-server-stop.bat
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Provide a port for quickstarting Kafka dev on Windows :
> - kafka-run-class.bat
> - kafka-server-start.bat
> - zookeeper-server-start.bat
> This will help Kafka community growth 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (KAFKA-581) provides windows batch script for starting Kafka/Zookeeper

2012-12-07 Thread antoine vianey (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526440#comment-13526440
 ] 

antoine vianey edited comment on KAFKA-581 at 12/7/12 3:11 PM:
---

*Added the stop scripts*

Can you add this scripts to the 0.8 branch as well.
This usefull when starting zookeeper and kafka from maven-antrun plugin or 
other that create a separate process... Zookeeper and Storm can be started and 
stopped in the pre and post-integration-test

Regards

  was (Author: avianey):
Can you add this scripts to the 0.8 branch as well.
This usefull when starting zookeeper and kafka from maven-antrun plugin or 
other that create a separate process... Zookeeper and Storm can be started and 
stopped in the pre and post-integration-test

Regards
  
> provides windows batch script for starting Kafka/Zookeeper
> --
>
> Key: KAFKA-581
> URL: https://issues.apache.org/jira/browse/KAFKA-581
> Project: Kafka
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 0.8
> Environment: Windows
>Reporter: antoine vianey
>Priority: Trivial
>  Labels: features, run, windows
> Fix For: 0.8
>
> Attachments: kafka-console-consumer.bat, kafka-console-producer.bat, 
> kafka-run-class.bat, kafka-server-start.bat, kafka-server-stop.bat, sbt.bat, 
> zookeeper-server-start.bat, zookeeper-server-stop.bat
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Provide a port for quickstarting Kafka dev on Windows :
> - kafka-run-class.bat
> - kafka-server-start.bat
> - zookeeper-server-start.bat
> This will help Kafka community growth 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (KAFKA-581) provides windows batch script for starting Kafka/Zookeeper

2012-12-07 Thread antoine vianey (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526440#comment-13526440
 ] 

antoine vianey edited comment on KAFKA-581 at 12/7/12 3:11 PM:
---

*Added the server-stop scripts*

Can you add this scripts to the 0.8 branch as well.
This usefull when starting zookeeper and kafka from maven-antrun plugin or 
other that create a separate process... Zookeeper and Storm can be started and 
stopped in the pre and post-integration-test

Regards

  was (Author: avianey):
*Added the stop scripts*

Can you add this scripts to the 0.8 branch as well.
This usefull when starting zookeeper and kafka from maven-antrun plugin or 
other that create a separate process... Zookeeper and Storm can be started and 
stopped in the pre and post-integration-test

Regards
  
> provides windows batch script for starting Kafka/Zookeeper
> --
>
> Key: KAFKA-581
> URL: https://issues.apache.org/jira/browse/KAFKA-581
> Project: Kafka
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 0.8
> Environment: Windows
>Reporter: antoine vianey
>Priority: Trivial
>  Labels: features, run, windows
> Fix For: 0.8
>
> Attachments: kafka-console-consumer.bat, kafka-console-producer.bat, 
> kafka-run-class.bat, kafka-server-start.bat, kafka-server-stop.bat, sbt.bat, 
> zookeeper-server-start.bat, zookeeper-server-stop.bat
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Provide a port for quickstarting Kafka dev on Windows :
> - kafka-run-class.bat
> - kafka-server-start.bat
> - zookeeper-server-start.bat
> This will help Kafka community growth 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


0.8/HEAD Console consumer breakage?

2012-12-07 Thread ben fleis
So I was testing my own code, and using the console consumer against my
seemingly-working-producer code.  Since the last update, the console
consumer crashes.  I am going to try to track it down in the debugger and
will come back with a patch if found.


Re: 0.8/HEAD Console consumer breakage?

2012-12-07 Thread ben fleis
Dah.  Misfire.  Please ignore if this makes it to an inbox ;)

-b



On Fri, Dec 7, 2012 at 4:13 PM, ben fleis  wrote:

> So I was testing my own code, and using the console consumer against my
> seemingly-working-producer code.  Since the last update, the console
> consumer crashes.  I am going to try to track it down in the debugger and
> will come back with a patch if found.
>
>
>
>
>
>
>


Re: 0.8 Protocol Status

2012-12-07 Thread ben fleis
I was testing my own code, and using the console consumer against my
seemingly-working-producer code.  Since the last update, the console
consumer crashes.  I am going to try to track it down in the debugger and
will come back with a patch if found.

Command line:
KAFKA_OPTS="-Xmx512M -server
-Dlog4j.configuration=file:$PWD/config/log4j.properties -Xdebug
-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=4244" \
bin/kafka-console-consumer.sh config/consumer.properties --zookeeper
localhost:2181 --topic types

Stacktrace:
[2012-12-07 16:11:34,421] ERROR Error processing message, stopping
consumer:  (kafka.consumer.ConsoleConsumer$)
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:247)
at kafka.message.Message.sliceDelimited(Message.scala:225)
at kafka.message.Message.payload(Message.scala:207)
at
kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:110)
at
kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
at
kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:61)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:53)
at scala.collection.Iterator$class.foreach(Iterator.scala:631)
at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
at
scala.collection.IterableLike$class.foreach(IterableLike.scala:79)
at kafka.consumer.KafkaStream.foreach(KafkaStream.scala:25)
at kafka.consumer.ConsoleConsumer$.main(ConsoleConsumer.scala:189)
at kafka.consumer.ConsoleConsumer.main(ConsoleConsumer.scala)

All advice gladly accepted, including "you blew it, you blind fool!" ;)

-b


[jira] [Created] (KAFKA-662) Create testcases for unclean shut down

2012-12-07 Thread John Fung (JIRA)
John Fung created KAFKA-662:
---

 Summary: Create testcases for unclean shut down
 Key: KAFKA-662
 URL: https://issues.apache.org/jira/browse/KAFKA-662
 Project: Kafka
  Issue Type: Bug
Reporter: John Fung
Assignee: John Fung




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (KAFKA-663) Add "deploy" feature to System Test

2012-12-07 Thread John Fung (JIRA)
John Fung created KAFKA-663:
---

 Summary: Add "deploy" feature to System Test
 Key: KAFKA-663
 URL: https://issues.apache.org/jira/browse/KAFKA-663
 Project: Kafka
  Issue Type: Task
Reporter: John Fung
Assignee: John Fung




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-662) Create testcases for unclean shut down

2012-12-07 Thread John Fung (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-662:


Issue Type: Task  (was: Bug)

> Create testcases for unclean shut down
> --
>
> Key: KAFKA-662
> URL: https://issues.apache.org/jira/browse/KAFKA-662
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>Assignee: John Fung
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: 0.8 Protocol Status

2012-12-07 Thread Neha Narkhede
Ben,

I would try to run DumpLogSegments to check if the server's data is
not corrupted due to a bug in the producer.

Thanks,
Neha

On Fri, Dec 7, 2012 at 7:17 AM, ben fleis  wrote:
> I was testing my own code, and using the console consumer against my
> seemingly-working-producer code.  Since the last update, the console
> consumer crashes.  I am going to try to track it down in the debugger and
> will come back with a patch if found.
>
> Command line:
> KAFKA_OPTS="-Xmx512M -server
> -Dlog4j.configuration=file:$PWD/config/log4j.properties -Xdebug
> -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=4244" \
> bin/kafka-console-consumer.sh config/consumer.properties --zookeeper
> localhost:2181 --topic types
>
> Stacktrace:
> [2012-12-07 16:11:34,421] ERROR Error processing message, stopping
> consumer:  (kafka.consumer.ConsoleConsumer$)
> java.lang.IllegalArgumentException
> at java.nio.Buffer.limit(Buffer.java:247)
> at kafka.message.Message.sliceDelimited(Message.scala:225)
> at kafka.message.Message.payload(Message.scala:207)
> at
> kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:110)
> at
> kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
> at
> kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:61)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:53)
> at scala.collection.Iterator$class.foreach(Iterator.scala:631)
> at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:79)
> at kafka.consumer.KafkaStream.foreach(KafkaStream.scala:25)
> at kafka.consumer.ConsoleConsumer$.main(ConsoleConsumer.scala:189)
> at kafka.consumer.ConsoleConsumer.main(ConsoleConsumer.scala)
>
> All advice gladly accepted, including "you blew it, you blind fool!" ;)
>
> -b


[jira] [Created] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)
Neha Narkhede created KAFKA-664:
---

 Summary: Kafka server threads die due to OOME during long running 
test
 Key: KAFKA-664
 URL: https://issues.apache.org/jira/browse/KAFKA-664
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Blocker
 Fix For: 0.8


I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
running producer process that sends data to 100s of partitions continuously for 
~15 hours. After ~4 hours of operation, few server threads (acceptor and 
processor) exited due to OOME -

[2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
(kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 'kafka-acceptor': 
(kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
'kafka-processor-9092-1': (kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
session 0x13afd0753870103 has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2012-12-07 08:24:46,344] INFO Initiating client connection, 
connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
 sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
(org.apache.zookeeper.ZooKeeper)
[2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
(kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
'kafka-request-handler-0': (kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:25:08,739] INFO Opening socket connection to server 
eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:25:14,221] INFO Socket connection established to 
eat1-app311.corp/172.20.72.75:12913, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
server in 3722ms for sessionid 0x0, closing socket connection and attempting 
reconnect (org.apache.zookeeper.ClientCnxn)
[2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
java.lang.OutOfMemoryError: Java heap space
[2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
(kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space


It seems like it runs out of memory while trying to read the producer request, 
but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-664:


Attachment: thread-dump.log

Attaching a thread dump that shows -

1. 4 processor threads and the acceptor threads are dead
2. Rest of the processor threads have a full request queue, and they are 
waiting to add to the request queue.

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526590#comment-13526590
 ] 

Neha Narkhede commented on KAFKA-664:
-

Another observation - The server is probably GCing quite a lot, since I see the 
following in the server logs -

[2012-12-07 09:32:14,742] INFO Client session timed out, have not heard from 
server in 1204905ms for sessionid 0x23afd074d6600ea, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)

The zookeeper session timeout is pretty high (15secs) and it is in the same DC 
as the Kafka cluster and the producer

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-651) Create testcases on auto create topics

2012-12-07 Thread John Fung (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-651:


Status: Patch Available  (was: Open)

> Create testcases on auto create topics
> --
>
> Key: KAFKA-651
> URL: https://issues.apache.org/jira/browse/KAFKA-651
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>  Labels: replication-testing
> Attachments: kafka-651-v1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-651) Create testcases on auto create topics

2012-12-07 Thread John Fung (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-651:


Attachment: kafka-651-v1.patch

Uploaded kafka-651-v1.patch with 1 testcase to cover each functional group:
testcase_0011
testcase_0024
testcase_0119
testcase_0128
testcase_0134
testcase_0159
testcase_0209
testcase_0259
testcase_0309

> Create testcases on auto create topics
> --
>
> Key: KAFKA-651
> URL: https://issues.apache.org/jira/browse/KAFKA-651
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>  Labels: replication-testing
> Attachments: kafka-651-v1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-374) Move to java CRC32 implementation

2012-12-07 Thread David Arthur (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur updated KAFKA-374:
---

Attachment: KAFKA-374.patch

Pure-java crc32 taken from Hadoop

> Move to java CRC32 implementation
> -
>
> Key: KAFKA-374
> URL: https://issues.apache.org/jira/browse/KAFKA-374
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-374-draft.patch, KAFKA-374.patch
>
>
> We keep a per-record crc32. This is fairly cheap algorithm, but the java 
> implementation uses JNI and it seems to be a bit expensive for small records. 
> I have seen this before in Kafka profiles, and I noticed it on another 
> application I was working on. Basically with small records the native 
> implementation can only checksum < 100MB/sec. Hadoop has done some analysis 
> of this and replaced it with a Java implementation that is 2x faster for 
> large values and 5-10x faster for small values. Details are here HADOOP-6148.
> We should do a quick read/write benchmark on log and message set iteration 
> and see if this improves things.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-374) Move to java CRC32 implementation

2012-12-07 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526624#comment-13526624
 ] 

David Arthur commented on KAFKA-374:


Not sure how you guys feel about having Java in the source tree, but I attached 
a patch with the pure Java implementation (and the other stuff from [~jkreps]'s 
original patch).

> Move to java CRC32 implementation
> -
>
> Key: KAFKA-374
> URL: https://issues.apache.org/jira/browse/KAFKA-374
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-374-draft.patch, KAFKA-374.patch
>
>
> We keep a per-record crc32. This is fairly cheap algorithm, but the java 
> implementation uses JNI and it seems to be a bit expensive for small records. 
> I have seen this before in Kafka profiles, and I noticed it on another 
> application I was working on. Basically with small records the native 
> implementation can only checksum < 100MB/sec. Hadoop has done some analysis 
> of this and replaced it with a Java implementation that is 2x faster for 
> large values and 5-10x faster for small values. Details are here HADOOP-6148.
> We should do a quick read/write benchmark on log and message set iteration 
> and see if this improves things.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (KAFKA-374) Move to java CRC32 implementation

2012-12-07 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526624#comment-13526624
 ] 

David Arthur edited comment on KAFKA-374 at 12/7/12 6:38 PM:
-

Not sure how you guys feel about having Java in the source tree, but I attached 
a patch with the pure Java implementation (and the other stuff from [~jkreps] 
original patch).

  was (Author: mumrah):
Not sure how you guys feel about having Java in the source tree, but I 
attached a patch with the pure Java implementation (and the other stuff from 
[~jkreps]'s original patch).
  
> Move to java CRC32 implementation
> -
>
> Key: KAFKA-374
> URL: https://issues.apache.org/jira/browse/KAFKA-374
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-374-draft.patch, KAFKA-374.patch
>
>
> We keep a per-record crc32. This is fairly cheap algorithm, but the java 
> implementation uses JNI and it seems to be a bit expensive for small records. 
> I have seen this before in Kafka profiles, and I noticed it on another 
> application I was working on. Basically with small records the native 
> implementation can only checksum < 100MB/sec. Hadoop has done some analysis 
> of this and replaced it with a Java implementation that is 2x faster for 
> large values and 5-10x faster for small values. Details are here HADOOP-6148.
> We should do a quick read/write benchmark on log and message set iteration 
> and see if this improves things.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (KAFKA-374) Move to java CRC32 implementation

2012-12-07 Thread David Arthur (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur updated KAFKA-374:
---

Comment: was deleted

(was: Pure-java crc32 taken from Hadoop)

> Move to java CRC32 implementation
> -
>
> Key: KAFKA-374
> URL: https://issues.apache.org/jira/browse/KAFKA-374
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 0.8
>Reporter: Jay Kreps
>Priority: Minor
>  Labels: newbie
> Attachments: KAFKA-374-draft.patch, KAFKA-374.patch
>
>
> We keep a per-record crc32. This is fairly cheap algorithm, but the java 
> implementation uses JNI and it seems to be a bit expensive for small records. 
> I have seen this before in Kafka profiles, and I noticed it on another 
> application I was working on. Basically with small records the native 
> implementation can only checksum < 100MB/sec. Hadoop has done some analysis 
> of this and replaced it with a Java implementation that is 2x faster for 
> large values and 5-10x faster for small values. Details are here HADOOP-6148.
> We should do a quick read/write benchmark on log and message set iteration 
> and see if this improves things.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-657) Add an API to commit offsets

2012-12-07 Thread David Arthur (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526687#comment-13526687
 ] 

David Arthur commented on KAFKA-657:


I have started a wiki page for this design discussion 
https://cwiki.apache.org/confluence/display/KAFKA/Commit+Offset+API+-+Proposal

> Add an API to commit offsets
> 
>
> Key: KAFKA-657
> URL: https://issues.apache.org/jira/browse/KAFKA-657
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Jay Kreps
>  Labels: project
>
> Currently the consumer directly writes their offsets to zookeeper. Two 
> problems with this: (1) This is a poor use of zookeeper, and we need to 
> replace it with a more scalable offset store, and (2) it makes it hard to 
> carry over to clients in other languages. A first step towards accomplishing 
> that is to add a proper Kafka API for committing offsets. The initial version 
> of this would just write to zookeeper as we do today, but in the future we 
> would then have the option of changing this.
> This api likely needs to take a sequence of 
> consumer-group/topic/partition/offset entries and commit them all.
> It would be good to do a wiki design on how this would work and consensus on 
> that first.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526713#comment-13526713
 ] 

Jay Kreps commented on KAFKA-664:
-

One pain of oom is that the thing leaking the memory is not necessarily the 
thing that gets the exception. Can you rerun with 
-XX:+HeapDumpOnOutOfMemoryError

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526719#comment-13526719
 ] 

Neha Narkhede commented on KAFKA-664:
-

Heap dump is here - 
http://people.apache.org/~nehanarkhede/kafka-misc/kafka-0.8/heap-dump.tar.gz
Almost all the largest objects trace back to 
RequestPurgatory$ExpiredRequestReaper as the GC root.

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526720#comment-13526720
 ] 

Neha Narkhede commented on KAFKA-664:
-

I'm re-running the tests with that option now

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Jay Kreps (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps reassigned KAFKA-664:
---

Assignee: Jay Kreps

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526805#comment-13526805
 ] 

Jay Kreps commented on KAFKA-664:
-

Looks like the problem is in request purgatory--watchers aren't getting removed.

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526808#comment-13526808
 ] 

Neha Narkhede commented on KAFKA-664:
-

The root cause seems to be that watchersForKey map keeps growing. I see that we 
add keys to the map, but never actually delete them.

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (KAFKA-665) Outgoing responses delayed on a busy Kafka broker

2012-12-07 Thread Neha Narkhede (JIRA)
Neha Narkhede created KAFKA-665:
---

 Summary: Outgoing responses delayed on a busy Kafka broker 
 Key: KAFKA-665
 URL: https://issues.apache.org/jira/browse/KAFKA-665
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Critical
 Fix For: 0.8


In a long running test, I observed that after a few hours of operation, few 
requests start timing out, mainly because they spent very long time sitting in 
the response queue -

[2012-12-07 22:05:56,670] TRACE Completed request with correlation id 3965966 
and client : TopicMetadataRequest:4009, queueTime:1, localTime:28, 
remoteTime:0, sendTime:3980 (kafka.network.RequestChannel$)
[2012-12-07 22:04:12,046] TRACE Completed request with correlation id 3962561 
and client : TopicMetadataRequest:3449, queueTime:0, localTime:29, 
remoteTime:0, sendTime:3420 (kafka.network.RequestChannel$)
[2012-12-07 22:05:56,670] TRACE Completed request with correlation id 3965966 
and client : TopicMetadataRequest:4009, queueTime:1, localTime:28, 
remoteTime:0, sendTime:3980 (kafka.network.RequestChannel$)

We might have a problem in the way we process outgoing responses. Basically, if 
the processor thread blocks on enqueuing requests in the request queue, it 
doesn't come around to processing its responses which are ready to go out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (KAFKA-666) Fetch requests from the replicas take several seconds to complete on the leader

2012-12-07 Thread Neha Narkhede (JIRA)
Neha Narkhede created KAFKA-666:
---

 Summary: Fetch requests from the replicas take several seconds to 
complete on the leader
 Key: KAFKA-666
 URL: https://issues.apache.org/jira/browse/KAFKA-666
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Neha Narkhede
Priority: Critical
 Fix For: 0.8


I've seen that fetch requests from the replicas take several seconds to 
complete. The nature of the latency breakdown is different, sometimes they 
spend too long sitting in the request/response queue, sometimes the 
local/remote time is too large -

[2012-12-07 20:59:22,424] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3502, 
queueTime:1, localTime:1, remoteTime:3500, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 20:59:22,611] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3301, 
queueTime:1, localTime:3118, remoteTime:181, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3981, 
queueTime:0, localTime:1, remoteTime:3979, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4095, 
queueTime:1, localTime:6, remoteTime:4088, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:57,254] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4116, 
queueTime:1, localTime:1, remoteTime:4113, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:57,300] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3844, 
queueTime:1, localTime:3795, remoteTime:48, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:14:19,645] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4239, 
queueTime:1, localTime:1, remoteTime:4236, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:14:19,689] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3977, 
queueTime:3931, localTime:8, remoteTime:38, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:23:58,427] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3940, 
queueTime:1, localTime:1, remoteTime:3938, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:23:58,435] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3858, 
queueTime:1, localTime:6, remoteTime:3851, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:21,575] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4037, 
queueTime:0, localTime:1, remoteTime:4036, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:21,583] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, 
queueTime:1, localTime:4, remoteTime:3956, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:43,965] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4294, 
queueTime:1, localTime:1, remoteTime:4292, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:44,013] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, 
queueTime:1, localTime:3919, remoteTime:41, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:06,157] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4587, 
queueTime:1, localTime:1, remoteTime:4585, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:06,162] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4276, 
queueTime:1, localTime:6, remoteTime:4268, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:28,943] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3986, 
queueTime:1, localTime:1, remoteTime:3984, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:28,953] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3940, 
queueTime:3929, localTime:6, remoteTime:5, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:51,374] TRACE Completed re

[jira] [Updated] (KAFKA-666) Fetch requests from the replicas take several seconds to complete on the leader

2012-12-07 Thread Neha Narkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-666:


Description: 
I've seen that fetch requests from the replicas take several seconds to 
complete. The nature of the latency breakdown is different, sometimes they 
spend too long sitting in the request/response queue, sometimes the 
local/remote time is too large -

[2012-12-07 20:14:51,233] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9967, 
queueTime:0, localTime:4, remoteTime:9963, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 20:14:51,236] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9967, 
queueTime:1, localTime:3, remoteTime:9963, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 20:14:51,239] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9966, 
queueTime:0, localTime:2, remoteTime:9964, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 20:16:07,643] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:9996, 
queueTime:1, localTime:2, remoteTime:9992, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 20:16:07,645] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.45-port_9092: FetchRequest:, 
queueTime:0, localTime:4, remoteTime:9994, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 20:59:22,424] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3502, 
queueTime:1, localTime:1, remoteTime:3500, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3981, 
queueTime:0, localTime:1, remoteTime:3979, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:35,042] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4095, 
queueTime:1, localTime:6, remoteTime:4088, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:57,254] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4116, 
queueTime:1, localTime:1, remoteTime:4113, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:13:57,300] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3844, 
queueTime:1, localTime:3795, remoteTime:48, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:14:19,645] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4239, 
queueTime:1, localTime:1, remoteTime:4236, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:14:19,689] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3977, 
queueTime:3931, localTime:8, remoteTime:38, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:23:58,427] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3940, 
queueTime:1, localTime:1, remoteTime:3938, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:23:58,435] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3858, 
queueTime:1, localTime:6, remoteTime:3851, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:21,575] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4037, 
queueTime:0, localTime:1, remoteTime:4036, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:21,583] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, 
queueTime:1, localTime:4, remoteTime:3956, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:43,965] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4294, 
queueTime:1, localTime:1, remoteTime:4292, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:24:44,013] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:3962, 
queueTime:1, localTime:3919, remoteTime:41, sendTime:1 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:06,157] TRACE Completed request with correlation id 0 and 
client replica-fetcher-host_172.20.72.48-port_9092: FetchRequest:4587, 
queueTime:1, localTime:1, remoteTime:4585, sendTime:0 
(kafka.network.RequestChannel$)
[2012-12-07 21:25:06,162] TRACE Completed reques

[jira] [Commented] (KAFKA-644) System Test should run properly with mixed File System Pathname

2012-12-07 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526851#comment-13526851
 ] 

Neha Narkhede commented on KAFKA-644:
-

+1. Thanks for the patch !

> System Test should run properly with mixed File System Pathname
> ---
>
> Key: KAFKA-644
> URL: https://issues.apache.org/jira/browse/KAFKA-644
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>Assignee: John Fung
>  Labels: replication-testing
> Attachments: kafka-644-v1.patch
>
>
> Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
> Consumer) are running in machines which have the same File System Pathname as 
> the machine in which the System Test scripts are running.
> Usually, our own local boxes would be like /home/kafka/. . .
> and remote boxes may look like /mnt/. . .
> In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-644) System Test should run properly with mixed File System Pathname

2012-12-07 Thread Neha Narkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-644:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

> System Test should run properly with mixed File System Pathname
> ---
>
> Key: KAFKA-644
> URL: https://issues.apache.org/jira/browse/KAFKA-644
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>Assignee: John Fung
>  Labels: replication-testing
> Attachments: kafka-644-v1.patch
>
>
> Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
> Consumer) are running in machines which have the same File System Pathname as 
> the machine in which the System Test scripts are running.
> Usually, our own local boxes would be like /home/kafka/. . .
> and remote boxes may look like /mnt/. . .
> In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (KAFKA-644) System Test should run properly with mixed File System Pathname

2012-12-07 Thread Neha Narkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede closed KAFKA-644.
---


> System Test should run properly with mixed File System Pathname
> ---
>
> Key: KAFKA-644
> URL: https://issues.apache.org/jira/browse/KAFKA-644
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>Assignee: John Fung
>  Labels: replication-testing
> Attachments: kafka-644-v1.patch
>
>
> Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
> Consumer) are running in machines which have the same File System Pathname as 
> the machine in which the System Test scripts are running.
> Usually, our own local boxes would be like /home/kafka/. . .
> and remote boxes may look like /mnt/. . .
> In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-597) Refactor KafkaScheduler

2012-12-07 Thread Jay Kreps (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps updated KAFKA-597:


Attachment: KAFKA-597-v4.patch

Patch v4. 
- Rebased
- Makes use of thread factory
- Fixed broken scaladoc

> Refactor KafkaScheduler
> ---
>
> Key: KAFKA-597
> URL: https://issues.apache.org/jira/browse/KAFKA-597
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Jay Kreps
>Priority: Minor
> Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, 
> KAFKA-597-v3.patch, KAFKA-597-v4.patch
>
>
> It would be nice to cleanup KafkaScheduler. Here is what I am thinking
> Extract the following interface:
> trait Scheduler {
>   def startup()
>   def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: 
> Long): Scheduled
>   def shutdown(interrupt: Boolean = false)
> }
> class Scheduled {
>   def lastExecution: Long
>   def cancel()
> }
> We would have two implementations, KafkaScheduler and  MockScheduler. 
> KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. 
> MockScheduler would only allow manual time advancement rather than using the 
> system clock, we would switch unit tests over to this.
> This change would be different from the existing scheduler in a the following 
> ways:
> 1. Would not return a ScheduledFuture (since this is useless)
> 2. shutdown() would be a blocking call. The current shutdown calls, don't 
> really do what people want.
> 3. We would remove the daemon thread flag, as I don't think it works.
> 4. It returns an object which let's you cancel the job or get the last 
> execution time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526876#comment-13526876
 ] 

Joel Koshy commented on KAFKA-664:
--

To clarify, the map itself shouldn't grow indefinitely right? - i.e., if there 
are no new partitions the number of keys should be the same. I think the issue 
is that expired requests (for a key) are not removed from the list of 
outstanding requests for that key.

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-636) Make log segment delete asynchronous

2012-12-07 Thread Jay Kreps (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps updated KAFKA-636:


Attachment: KAFKA-636-v1.patch

This patch implements asynchronous delete in the log.

To do this Log.scala now requires a scheduler to be used for scheduling the 
deletions.

The deletion works as described above.

The locking for segment deletion can now be more aggressive since the file 
renames are assumed to be fast they can be inside the lock.

As part of testing this I also found a problem with MockScheduler, namely that 
it does not reentrant. That is, if scheduled tasks themselves create scheduled 
tasks it misbehaves. To fix this I rewrote MockScheduler to use a priority 
queue. The code is simpler and more correct since it now performs all 
executions in the correct order too.

> Make log segment delete asynchronous
> 
>
> Key: KAFKA-636
> URL: https://issues.apache.org/jira/browse/KAFKA-636
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jay Kreps
>Assignee: Jay Kreps
> Attachments: KAFKA-636-v1.patch
>
>
> We have a few corner-case bugs around delete of segment files:
> 1. It is possible for delete and truncate to kind of cross streams and end up 
> with a case where you have no segments.
> 2. Reads on the log have no locking (which is good) but as a result deleting 
> a segment that is being read will result in some kind of I/O exception.
> 3. We can't easily fix the synchronization problems without deleting files 
> inside the log's write lock. This can be a problem as deleting a 2GB segment 
> can take a couple of seconds even on an unloaded system.
> The proposed fix for these problems is to make file removal asynchronous 
> using the following scheme as the new delete scheme:
> 1. Immediately remove the file from segment map and rename the file from X to 
> X.deleted (e.g. 000.log to 00.log.deleted. We think renaming a file 
> will not impact reads since the file is already open and hence the name is 
> irrelevant. This will always be O(1) and can be done inside the write lock.
> 2. Schedule a future operation to delete the file. The time to wait would be 
> configurable but we would just default it to 60 seconds and probably no one 
> would ever change it.
> 3. On startup we would delete any files with the .deleted suffix as they 
> would have been pending deletes that didn't take place.
> I plan to do this soon working against the refactored log (KAFKA-521). We can 
> opt to back port the patch for 0.8 if we are feeling daring.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526883#comment-13526883
 ] 

Joel Koshy commented on KAFKA-664:
--

Okay I'm slightly confused. Even on expiration the request is marked as 
satisfied. So even if it is not removed from the watcher's list during 
expiration it will be removed on the next call to collectSatisfiedRequests - 
which in this case will be when the next produce request arrives to that 
partition. Which means this should only be due to low-volume partitions that 
are no longer growing. i.e., the replica fetcher would keep issuing fetch 
requests that keep expiring but never get removed from the list of pending 
requests in watchersFor(the-low-volume-partition).

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526887#comment-13526887
 ] 

Jay Kreps commented on KAFKA-664:
-

Another issue is that we are saving the full producer request in memory for as 
long as it is in purgatory. Not sure that is causing this, but that is pretty 
bad.

> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: thread-dump.log
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-644) System Test should run properly with mixed File System Pathname

2012-12-07 Thread John Fung (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Fung updated KAFKA-644:


Attachment: kafka-644-v2.patch

Uploaded kafka-644-v2.patch which supports the property "auto_create_topic"

> System Test should run properly with mixed File System Pathname
> ---
>
> Key: KAFKA-644
> URL: https://issues.apache.org/jira/browse/KAFKA-644
> Project: Kafka
>  Issue Type: Task
>Reporter: John Fung
>Assignee: John Fung
>  Labels: replication-testing
> Attachments: kafka-644-v1.patch, kafka-644-v2.patch
>
>
> Currently, System Test assumes that all the entities (ZK, Broker, Producer, 
> Consumer) are running in machines which have the same File System Pathname as 
> the machine in which the System Test scripts are running.
> Usually, our own local boxes would be like /home/kafka/. . .
> and remote boxes may look like /mnt/. . .
> In this case, System Test won't work properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level

2012-12-07 Thread Swapnil Ghike (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swapnil Ghike updated KAFKA-646:


Attachment: kafka-646-patch-num1-v1.patch

This patch has a bunch of refactoring changes and a couple of new additions. 

Addressing Jun's comments: 
These are all great catches! Thanks for being so thorough.

60. By default, metrics-core will return an existing metric object of the same 
name using a getOrCreate() like functionality. As discussed offline, we should 
fail the clients that use an already registered clientId name. We will need to 
create two objects thaty contain hashmaps to record the existing producer and 
consumer clientIds and methods to throw an exception if a client attempts to 
use an existing clientId. I worked on this change a bit, but it breaks a lot of 
our unit tests (about half) and the refactoring will take some time. Hence, I 
think it will be better if I submit a patch for all other changes and create 
another patch for this issue under this jira. Until then we can keep this jira 
open.

61. For recording stats about all topics, I am now using a string "All.Topics". 
Since '.' is not allowed in the legal character set for topic names, this will 
differentiate from a topic named AllTopics.

62. Yes, we should validate groupId. Added the functionality and a unit test. 
It has the same validation rules as ClientId.

63. A metric name is something like (clientId + topic + some string) and this 
entire string is limited by fillename size. We already allow topic name to be 
at most 255 bytes long. We could fix max lengths for each of clientId, groupId, 
topic name so that the metric name never exceeds filename size. But those 
lengths will be quite arbitrary, perhaps we should skip the check on the length 
of clientId and groupId. 

64. Removed brokerInfo from the clientId used to instantiate 
FetchRequestBuilder.


Refactoring: 
1. Moved validation of clientId at the end of instantiation of ProducerConfig 
and ConsumerConfig. 
- Created static objects ProducerConfig and ConsumerConfig which contain a 
validate() method.

2. Created global *Registry objects in which each high level Producer and 
Consumer can register their *stats objects.
- These objects are registered in the static object only once using 
utils.Pool.getAndMaybePut functionality. 
- This will remove the need to pass *stats objects around the code in 
constructors (I thought having the metrics objects right up in the constructors 
was a bit intrusive, since one doesn't quite always think about the monitoring 
mechanism while instantiating various modules of the program, for example while 
unit testing.)
- Instead of the constructor, each concerned class obtains the *Stats objects 
from the global registry object.
- This cleans up any metrics objects created in the unit tests.
- Special mention: The producer constructors are back to the old themselves. 
With clientId validation moved to *Config objects, the intermediate Producer 
constructor that merely separated the parameters of a quadruplet is gone.

3. Created separate files
-  for ProducerStats, ProducerTopicStats, ProducerRequestStats in 
kafka.producer package and for FetchRequestAndResponseStats in kafka.consumer 
package. Thought it was appropriate given that we already had 
ConsumerTopicStats in a separate file, and since the code for metrics had 
increased in size due to addition of *Registry and Aggregated* objects. Added 
comments.
- for objects Topic, ClientId and GroupId in kafka.utils package.
- to move the helper case classes ClientIdAndTopic, ClientIdAndBroker to 
kafka.common package. 

4. Renamed a few variables to easier names (anyOldName to "metricId" change).


New additions: 
1. Added two objects to aggregate metrics recorded by SyncProducers and 
SimpleConsumers at the high level Producer and Consumer. 
- For this, changed KafkaTimer to accept a list of Timers. Typically we will 
pass a specificTimer and a globalTimer to this KafkaTimer class. Created a new 
KafkaHistogram in a similar way.

2. Validation of groupId.


Issues:
1. Initializing the aggregator metrics with default values: For example, let's 
say that a syncProducer could be created (which will register a 
ProducerRequestStats mbean for this syncProducer). However, if no request is 
sent by this syncProducer then the absense of its data is not reflected in the 
aggregator histogram. For instance, the min requestSize for the syncProducer 
that never sent a request will be 0, but this won't be accurately represented 
in the aggregator histogram. Thus, we need to understand that if the request 
count of a syncProducer is 0, then its data will not be accurately reflected in 
the aggregator histogram.

The question is whether it is possible to inform the aggregator histogram of 
some default values without increasing the request count of any syncProducer or 
the aggregated stats.


Further 

[jira] [Commented] (KAFKA-597) Refactor KafkaScheduler

2012-12-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526966#comment-13526966
 ] 

Joel Koshy commented on KAFKA-597:
--

I haven't fully reviewed, but a couple of initial comments:
- I think the javadoc on KafkaScheduler's daemon param is a bit misleading as 
it currently suggests that daemon=true would prevent the VM from shutting down.
- The patch inverts the daemon flag on some of the existing usages of 
KafkaScheduler - i.e., daemon now defaults to true and there are some places 
where daemon was false. We would need to survey these usages and identify 
whether it makes sense to keep them non-daemon or not.
- The other question is on shutdownNow: the previous scheduler allowed the 
relaxed shutdown - i.e., don't interrupt threads that are currently executing. 
This change forces all shutdowns to use shutdownNow. Question is whether there 
are existing tasks that need to complete that would not tolerate an interrupt. 
I'm not sure about that - we'll need to look at existing usages. E.g., 
KafkaServer's kafkaScheduler used the shutdown() method - now it's effectively 
shutdownNow.


> Refactor KafkaScheduler
> ---
>
> Key: KAFKA-597
> URL: https://issues.apache.org/jira/browse/KAFKA-597
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Jay Kreps
>Priority: Minor
> Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, 
> KAFKA-597-v3.patch, KAFKA-597-v4.patch
>
>
> It would be nice to cleanup KafkaScheduler. Here is what I am thinking
> Extract the following interface:
> trait Scheduler {
>   def startup()
>   def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: 
> Long): Scheduled
>   def shutdown(interrupt: Boolean = false)
> }
> class Scheduled {
>   def lastExecution: Long
>   def cancel()
> }
> We would have two implementations, KafkaScheduler and  MockScheduler. 
> KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. 
> MockScheduler would only allow manual time advancement rather than using the 
> system clock, we would switch unit tests over to this.
> This change would be different from the existing scheduler in a the following 
> ways:
> 1. Would not return a ScheduledFuture (since this is useless)
> 2. shutdown() would be a blocking call. The current shutdown calls, don't 
> really do what people want.
> 3. We would remove the daemon thread flag, as I don't think it works.
> 4. It returns an object which let's you cancel the job or get the last 
> execution time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-664:


Attachment: watchersForKey.png
kafka-664-draft.patch

The problem was ever increasing requests in the watchersForKey map. Please look 
at the graph attached.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.


> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: kafka-664-draft.patch, thread-dump.log, 
> watchersForKey.png
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Neha Narkhede (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526973#comment-13526973
 ] 

Neha Narkhede edited comment on KAFKA-664 at 12/8/12 1:56 AM:
--

The problem was ever increasing requests in the watchersForKey map. Please look 
at the graph attached. In merely 40 minutes of running the broker, the number 
of requests in the purgatory map shot upto 4 million.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.


  was (Author: nehanarkhede):
The problem was ever increasing requests in the watchersForKey map. Please 
look at the graph attached.
This can happen for very low volume topics since the replica fetcher requests 
keep entering this map, and since there are no more produce requests coming for 
those topics/partitions, no one ever removes those requests from the map. 

With Joel's help, hacked RequestPurgatory to force the cleanup of 
expired/satisfied requests by the expiry thread inside purgeSatisfied. Of 
course, a better solution is re-designing the purgatory data structure to point 
from the queue to the map, but that is a bigger change. I just want to get 
around this issue and continue performance testing.

  
> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: kafka-664-draft.patch, thread-dump.log, 
> watchersForKey.png
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JI

[jira] [Commented] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level

2012-12-07 Thread Swapnil Ghike (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527019#comment-13527019
 ] 

Swapnil Ghike commented on KAFKA-646:
-

Actually I just realized that the Aggregated*Stats objects that I have created 
are not consistent with the way "All.Topics-MessageRate" is measured. It is 
possible to measure "All.Brokers-producerRequestSize" in a similar way in 
ProducerRequestStats.
But it's not possible to measure "All.Brokers-ProduceRequestRateAndTimeMs" in 
the same manner since we use a timer block. 

To make everything look consistent, I can delete the aggregator objects from my 
v1 patch and create a KafkaMeter class that accepts a list of meters. Will 
upload another version of patch.



> Provide aggregate stats at the high level Producer and 
> ZookeeperConsumerConnector level
> ---
>
> Key: KAFKA-646
> URL: https://issues.apache.org/jira/browse/KAFKA-646
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Swapnil Ghike
>Assignee: Swapnil Ghike
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: kafka-646-patch-num1-v1.patch
>
>
> WIth KAFKA-622, we measure ProducerRequestStats and 
> FetchRequestAndResponseStats at the SyncProducer and SimpleConsumer level 
> respectively. We could also aggregate them in the high level Producer and 
> ZookeeperConsumerConnector level to provide an overall sense of 
> request/response rate/size at the client level. Currently, I am not 
> completely clear about the math that might be necessary for such  aggregation 
> or if metrics already provides an API for aggregating stats of the same type.
> We should also address the comments by Jun at KAFKA-622, I am copy pasting 
> them here:
> 60. What happens if have 2 instances of Consumers with the same clientid in 
> the same jvm? Does one of them fail because it fails to register metrics? 
> Ditto for Producers.
> 61. ConsumerTopicStats: What if a topic is named AllTopics? We use to handle 
> this by adding a - in topic specific stats.
> 62. ZookeeperConsumerConnector: Do we need to validate groupid?
> 63. ClientId: Does the clientid length need to be different from topic length?
> 64. AbstractFetcherThread: When building a fetch request, do we need to pass 
> in brokerInfo as part of the client id? BrokerInfo contains the source broker 
> info and the fetch requests are always made to the source broker.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-597) Refactor KafkaScheduler

2012-12-07 Thread Jay Kreps (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps updated KAFKA-597:


Attachment: KAFKA-597-v5.patch

Thanks, new patch v5 addresses your comments:
- Improved javadoc
- This is actually good. I thought about it a bit and since I am making 
shutdown block the only time daemon vs non-daemon comes into play is if you 
don't call shutdown. If that is the case non-daemon threads will prevent 
garbage collection of the scheduler tasks and eventually block shutdown of the 
jvm, which seems unnecessary.
- The change to shutdownNow is not good. This will invoke interrupt on all 
threads, which is too aggressive. Better to let them finish. If we end up 
needing to schedule long-running tasks we can invent a new notification 
mechanism. I changed this so that we use normal shutdown instead.

> Refactor KafkaScheduler
> ---
>
> Key: KAFKA-597
> URL: https://issues.apache.org/jira/browse/KAFKA-597
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Jay Kreps
>Priority: Minor
> Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, 
> KAFKA-597-v3.patch, KAFKA-597-v4.patch, KAFKA-597-v5.patch
>
>
> It would be nice to cleanup KafkaScheduler. Here is what I am thinking
> Extract the following interface:
> trait Scheduler {
>   def startup()
>   def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: 
> Long): Scheduled
>   def shutdown(interrupt: Boolean = false)
> }
> class Scheduled {
>   def lastExecution: Long
>   def cancel()
> }
> We would have two implementations, KafkaScheduler and  MockScheduler. 
> KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. 
> MockScheduler would only allow manual time advancement rather than using the 
> system clock, we would switch unit tests over to this.
> This change would be different from the existing scheduler in a the following 
> ways:
> 1. Would not return a ScheduledFuture (since this is useless)
> 2. shutdown() would be a blocking call. The current shutdown calls, don't 
> really do what people want.
> 3. We would remove the daemon thread flag, as I don't think it works.
> 4. It returns an object which let's you cancel the job or get the last 
> execution time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-664) Kafka server threads die due to OOME during long running test

2012-12-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527039#comment-13527039
 ] 

Joel Koshy commented on KAFKA-664:
--

+1 

Some minor comments:
- We can probably remove the WatchersForKey gauge (or maybe keep it until the 
RequestPurgatory refactoring is done).
- While I agree we should definitely refactor the RequestPurgatory to fix the 
inefficient scan, I think this approach is not as hacky as it sounds. i.e., on 
fetch request expiration, this now does what would have been done if a produce 
request to that key had arrived; so we can consider the overhead of this 
approach as sending additional produce requests to the affected partition at 
the rate of fetch expirations (which by default is 2/sec). We can optimize a 
bit more, by adding a threshold for cleanup. i.e., do the iteration and 
check/removal only if watchers.requests.size > threshold.


> Kafka server threads die due to OOME during long running test
> -
>
> Key: KAFKA-664
> URL: https://issues.apache.org/jira/browse/KAFKA-664
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Neha Narkhede
>Assignee: Jay Kreps
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: kafka-664-draft.patch, thread-dump.log, 
> watchersForKey.png
>
>
> I set up a Kafka cluster with 5 brokers (JVM memory 512M) and set up a long 
> running producer process that sends data to 100s of partitions continuously 
> for ~15 hours. After ~4 hours of operation, few server threads (acceptor and 
> processor) exited due to OOME -
> [2012-12-07 08:24:44,355] ERROR OOME with size 1700161893 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-acceptor': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:44,356] ERROR Uncaught exception in thread 
> 'kafka-processor-9092-1': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:24:46,344] INFO Unable to reconnect to ZooKeeper service, 
> session 0x13afd0753870103 has expired, closing socket connection 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:24:46,344] INFO zookeeper state changed (Expired) 
> (org.I0Itec.zkclient.ZkClient)
> [2012-12-07 08:24:46,344] INFO Initiating client connection, 
> connectString=eat1-app309.corp:12913,eat1-app310.corp:12913,eat1-app311.corp:12913,eat1-app312.corp:12913,eat1-app313.corp:12913
>  sessionTimeout=15000 watcher=org.I0Itec.zkclient.ZkClient@19202d69 
> (org.apache.zookeeper.ZooKeeper)
> [2012-12-07 08:24:55,702] ERROR OOME with size 2001040997 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:01,192] ERROR Uncaught exception in thread 
> 'kafka-request-handler-0': (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:08,739] INFO Opening socket connection to server 
> eat1-app311.corp/172.20.72.75:12913 (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:14,221] INFO Socket connection established to 
> eat1-app311.corp/172.20.72.75:12913, initiating session 
> (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:17,943] INFO Client session timed out, have not heard from 
> server in 3722ms for sessionid 0x0, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> [2012-12-07 08:25:19,805] ERROR error in loggedRunnable (kafka.utils.Utils$)
> java.lang.OutOfMemoryError: Java heap space
> [2012-12-07 08:25:23,528] ERROR OOME with size 1853095936 
> (kafka.network.BoundedByteBufferReceive)
> java.lang.OutOfMemoryError: Java heap space
> It seems like it runs out of memory while trying to read the producer 
> request, but its unclear so far. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level

2012-12-07 Thread Swapnil Ghike (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swapnil Ghike updated KAFKA-646:


Attachment: kafka-646-patch-num1-v2.patch

Attached patch v2. 

The changes from patch v1:
1. Deleted KafkaHistogram class. (There is also no need for KafkaTImer class.)

2. Deleted the Aggregated*Stats objects. 
- The metrics of SyncProducer and SimpleConsumer for different brokers are 
aggregated together using the same way the producerTopicStats are aggregated 
for "All.Topics". 
- Measuring the time for produce requests and fetch requests is achieved by 
passing a list of timers to KafkaTimer.

> Provide aggregate stats at the high level Producer and 
> ZookeeperConsumerConnector level
> ---
>
> Key: KAFKA-646
> URL: https://issues.apache.org/jira/browse/KAFKA-646
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Swapnil Ghike
>Assignee: Swapnil Ghike
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: kafka-646-patch-num1-v1.patch, 
> kafka-646-patch-num1-v2.patch
>
>
> WIth KAFKA-622, we measure ProducerRequestStats and 
> FetchRequestAndResponseStats at the SyncProducer and SimpleConsumer level 
> respectively. We could also aggregate them in the high level Producer and 
> ZookeeperConsumerConnector level to provide an overall sense of 
> request/response rate/size at the client level. Currently, I am not 
> completely clear about the math that might be necessary for such  aggregation 
> or if metrics already provides an API for aggregating stats of the same type.
> We should also address the comments by Jun at KAFKA-622, I am copy pasting 
> them here:
> 60. What happens if have 2 instances of Consumers with the same clientid in 
> the same jvm? Does one of them fail because it fails to register metrics? 
> Ditto for Producers.
> 61. ConsumerTopicStats: What if a topic is named AllTopics? We use to handle 
> this by adding a - in topic specific stats.
> 62. ZookeeperConsumerConnector: Do we need to validate groupid?
> 63. ClientId: Does the clientid length need to be different from topic length?
> 64. AbstractFetcherThread: When building a fetch request, do we need to pass 
> in brokerInfo as part of the client id? BrokerInfo contains the source broker 
> info and the fetch requests are always made to the source broker.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (KAFKA-646) Provide aggregate stats at the high level Producer and ZookeeperConsumerConnector level

2012-12-07 Thread Swapnil Ghike (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527078#comment-13527078
 ] 

Swapnil Ghike edited comment on KAFKA-646 at 12/8/12 6:17 AM:
--

Attached patch v2. 

The changes from patch v1:
1. Deleted KafkaHistogram class. (There is also no need for KafkaMeter class.)

2. Deleted the Aggregated*Stats objects. 
- The metrics of SyncProducer and SimpleConsumer for different brokers are 
aggregated together using the same way the producerTopicStats are aggregated 
for "All.Topics". 
- Measuring the time for produce requests and fetch requests is achieved by 
passing a list of timers to KafkaTimer.

  was (Author: swapnilghike):
Attached patch v2. 

The changes from patch v1:
1. Deleted KafkaHistogram class. (There is also no need for KafkaTImer class.)

2. Deleted the Aggregated*Stats objects. 
- The metrics of SyncProducer and SimpleConsumer for different brokers are 
aggregated together using the same way the producerTopicStats are aggregated 
for "All.Topics". 
- Measuring the time for produce requests and fetch requests is achieved by 
passing a list of timers to KafkaTimer.
  
> Provide aggregate stats at the high level Producer and 
> ZookeeperConsumerConnector level
> ---
>
> Key: KAFKA-646
> URL: https://issues.apache.org/jira/browse/KAFKA-646
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Swapnil Ghike
>Assignee: Swapnil Ghike
>Priority: Blocker
>  Labels: bugs
> Fix For: 0.8
>
> Attachments: kafka-646-patch-num1-v1.patch, 
> kafka-646-patch-num1-v2.patch
>
>
> WIth KAFKA-622, we measure ProducerRequestStats and 
> FetchRequestAndResponseStats at the SyncProducer and SimpleConsumer level 
> respectively. We could also aggregate them in the high level Producer and 
> ZookeeperConsumerConnector level to provide an overall sense of 
> request/response rate/size at the client level. Currently, I am not 
> completely clear about the math that might be necessary for such  aggregation 
> or if metrics already provides an API for aggregating stats of the same type.
> We should also address the comments by Jun at KAFKA-622, I am copy pasting 
> them here:
> 60. What happens if have 2 instances of Consumers with the same clientid in 
> the same jvm? Does one of them fail because it fails to register metrics? 
> Ditto for Producers.
> 61. ConsumerTopicStats: What if a topic is named AllTopics? We use to handle 
> this by adding a - in topic specific stats.
> 62. ZookeeperConsumerConnector: Do we need to validate groupid?
> 63. ClientId: Does the clientid length need to be different from topic length?
> 64. AbstractFetcherThread: When building a fetch request, do we need to pass 
> in brokerInfo as part of the client id? BrokerInfo contains the source broker 
> info and the fetch requests are always made to the source broker.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (KAFKA-597) Refactor KafkaScheduler

2012-12-07 Thread Joel Koshy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527084#comment-13527084
 ] 

Joel Koshy commented on KAFKA-597:
--

On daemon vs non-daemon and shutdown vs shutdownNow. I may be misunderstanding 
the javadoc but I think: since the
default is now daemon=true and you switched to use shutdown, VM shutdown can 
continue even in the middle of a
scheduled task like checkpointing high watermarks or cleaning up logs. i.e., 
there may be such scenarios where it makes
sense to make them non-daemon - i.e., set it as a non-daemon, and use shutdown 
(not shutdownNow - or use
shutdownNow and handle InterruptedException properly in the task) to let them 
finish gracefully. Otherwise (iiuc)
it seems if we call shutdown on the executor the VM could exit and simply kill 
(i.e., abruptly terminate) any running
task that was started by the executor in one of the (daemon) threads from its 
pool.

Minor comments:
- Line 81 of KafkaScheduler: closing brace is mis-aligned.
- The scaladoc on MockScheduler uses a non-existent schedule variant - i.e., I 
think you intended to add a period < 0
  no?


> Refactor KafkaScheduler
> ---
>
> Key: KAFKA-597
> URL: https://issues.apache.org/jira/browse/KAFKA-597
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Jay Kreps
>Priority: Minor
> Attachments: KAFKA-597-v1.patch, KAFKA-597-v2.patch, 
> KAFKA-597-v3.patch, KAFKA-597-v4.patch, KAFKA-597-v5.patch
>
>
> It would be nice to cleanup KafkaScheduler. Here is what I am thinking
> Extract the following interface:
> trait Scheduler {
>   def startup()
>   def schedule(fun: () => Unit, name: String, delayMs: Long = 0, periodMs: 
> Long): Scheduled
>   def shutdown(interrupt: Boolean = false)
> }
> class Scheduled {
>   def lastExecution: Long
>   def cancel()
> }
> We would have two implementations, KafkaScheduler and  MockScheduler. 
> KafkaScheduler would be a wrapper for ScheduledThreadPoolExecutor. 
> MockScheduler would only allow manual time advancement rather than using the 
> system clock, we would switch unit tests over to this.
> This change would be different from the existing scheduler in a the following 
> ways:
> 1. Would not return a ScheduledFuture (since this is useless)
> 2. shutdown() would be a blocking call. The current shutdown calls, don't 
> really do what people want.
> 3. We would remove the daemon thread flag, as I don't think it works.
> 4. It returns an object which let's you cancel the job or get the last 
> execution time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira