too many rebalances in my consumer log and fetcher threads are getting stopped

2014-04-18 Thread ankit tyagi
Hi, I am seeing consumer re-balances very frequently and getting socket reconnect exception. log is given below for more insights [2014-04-18 16:02:52.061][kafka.coms.consumer.kafka_topic_coms_esb_prod_coms.coms-timemachine.coms.coms04.snapdeal.com_coms04.snapdeal.com-1397812122323-509d9663_watc

Re: Cluster design distribution and JBOD vs RAID

2014-04-18 Thread Andrew Otto
> BOB> We are using RAID10. It was a requirement from our Unix guys. The > rationale for this was we didn't want to lose just a disk and to have to > rebuild/re-replicate 20TB of data. We haven't experienced any drive failures > that I am aware of. We have had complete server failures, but the d

KAFKA-717

2014-04-18 Thread Seshadri, Balaji
I'm trying to apply the patch from KAFKA-717 for 0.8.0-BETA candidate and it fails. Error: Patch failed:project/Build.scala Project/Build.scala patch does not apply. Please let me know if you guys have how to do it. Thanks, Balaji

Reporting Metrics to Apache Kafka and Monitoring with Consumers

2014-04-18 Thread Joe Stein
Hi, we started a new github project for using Kafka as the central point for all application and infrastructure metrics https://github.com/stealthly/metrics-kafka/. We started off with implementing a Metrics Reporter (for Coda Hale's metrics) which produces (reports) the statistics to a Kafka topi

commitOffsets by partition 0.8-beta

2014-04-18 Thread Seshadri, Balaji
Hi, We have use case in DISH where we need to stop the consumer when we have issues in proceeding further to database or another back end. We update offset manually for each consumed message. There are 4 threads(e.g) consuming from same connector and when one thread commits the offset there is

Re: Cluster design distribution and JBOD vs RAID

2014-04-18 Thread Jay Kreps
If you lose one drive in a JBOD setup you will just re-replicate the data on that disk. It is similar to what you would do during RAID repair except that instead of having the data coming 100% from the mirror drives the load will be spread over the rest of the cluster. The real downside of RAID is

KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Alex Demidko
Hi, I’m performing a producing load test on two node kafka cluster built from the last 0.8.1 branch sources. I have topic loadtest with replication factor 2 and 256 partitions. Initially both brokers are in ISR and leadership is balanced. When in the middle of the load test one broker was resta

Re: too many rebalances in my consumer log and fetcher threads are getting stopped

2014-04-18 Thread Jun Rao
Have you looked at https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog ? Thanks, Jun On Fri, Apr 18, 2014 at 3:58 AM, ankit tyagi wrote: > Hi, > > I am seeing consumer re-balances very frequently and getting socket > reconnect exception. log is gi

Re: KAFKA-717

2014-04-18 Thread Jun Rao
Balaji, 0.8.0-BETA is too old and we are not patching it any more. You probably can try 0.8.0 or wait until 0.8.1.1 is out. Thanks, Jun On Fri, Apr 18, 2014 at 8:26 AM, Seshadri, Balaji wrote: > I'm trying to apply the patch from KAFKA-717 for 0.8.0-BETA candidate and > it fails. > > Error: >

RE: KAFKA-717

2014-04-18 Thread Seshadri, Balaji
Hi Jun, We could not move to 0.8.1 because of issues we have in upgrade. We are still in 0.8-beta1. Balaji -Original Message- From: Jun Rao [mailto:jun...@gmail.com] Sent: Friday, April 18, 2014 11:23 AM To: users@kafka.apache.org Subject: Re: KAFKA-717 Balaji, 0.8.0-BETA is too old

Re: Reporting Metrics to Apache Kafka and Monitoring with Consumers

2014-04-18 Thread Jun Rao
Wow, Joe. That looks great. Could you add it to our wiki? Thanks, Jun On Fri, Apr 18, 2014 at 9:51 AM, Joe Stein wrote: > Hi, we started a new github project for using Kafka as the central point > for all application and infrastructure metrics > https://github.com/stealthly/metrics-kafka/. >

Re: Reporting Metrics to Apache Kafka and Monitoring with Consumers

2014-04-18 Thread Joe Stein
Thanks Jun! I added it to the ecosystem page. On Fri, Apr 18, 2014 at 1:26 PM, Jun Rao wrote: > Wow, Joe. That looks great. Could you add it to our wiki? > > Thanks, > > Jun > > > On Fri, Apr 18, 2014 at 9:51 AM, Joe Stein wrote: > > > Hi, we started a new github project for using Kafka as the

Re: commitOffsets by partition 0.8-beta

2014-04-18 Thread Jun Rao
We don't have the ability to commit offset at the partition level now. This feature probably won't be available until we are done with the consumer rewrite, which is 3-4 months away. If you want to do sth now and don't want to use SimpleConsumer, another hacky way is to turn off auto offset commit

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Jun Rao
Any errors from the controller/state-change log? Thanks, Jun On Fri, Apr 18, 2014 at 9:57 AM, Alex Demidko wrote: > Hi, > > I’m performing a producing load test on two node kafka cluster built from > the last 0.8.1 branch sources. I have topic loadtest with replication > factor 2 and 256 parti

RE: commitOffsets by partition 0.8-beta

2014-04-18 Thread Seshadri, Balaji
Thanks Jun. -Original Message- From: Jun Rao [mailto:jun...@gmail.com] Sent: Friday, April 18, 2014 11:37 AM To: users@kafka.apache.org Subject: Re: commitOffsets by partition 0.8-beta We don't have the ability to commit offset at the partition level now. This feature probably won't be

Re: too many rebalances in my consumer log and fetcher threads are getting stopped

2014-04-18 Thread ankit tyagi
I have checked that. There was no full gc at that time. I have attached jstat ouput too in my mail. I have concerned over why consumer fetcher thread are getting stopped. On 18 Apr 2014 22:51, "Jun Rao" wrote: > Have you looked at > > https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Why

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Alex Demidko
These on alive node: 2014-04-17 21:36:29,276 ERROR [ZkClient-EventThread-15] state.change.logger - Controller 2 epoch 8 encountered error while electing leader for partition [loadtest,143] due to: Preferred replica 1 for partition [loadtest,143] is either not alive or not in the isr. Current le

RE: Controller is not being failed over 0.8.1

2014-04-18 Thread Bello, Bob
Success! I can now failover without getting stuck into the logging loop. I am able to failover between Kafka brokers. (Version 0.8.1) I adjusted the following settings: #(was 3) controller.socket.timeout.ms=9 controlled.shutdown.enable=true controlled.shutdown.max.retries=3 #(was 5000)

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Alexander Demidko
Have tried to reproduce this error, and it occurs pretty consistently when node being forcefully shutdown w/o graceful termination. When graceful shutdown was successful no errors occur in a log when the instance was rebooted starts. On Fri, Apr 18, 2014 at 11:17 AM, Alex Demidko wrote: > These

RE: Cluster design distribution and JBOD vs RAID

2014-04-18 Thread Bello, Bob
Yes you would lose the topic/partitions on the drive. I'm not quite sure if Kafka can determine what topics/partitions are missing or not. I suggest you try testing it. - Bob -Original Message- From: Andrew Otto [mailto:ao...@wikimedia.org] Sent: Friday, April 18, 2014 8:36 AM To: use

Re: KAFKA-717

2014-04-18 Thread Guozhang Wang
Hi Balaji, What issues do you have doing the upgrade? On Fri, Apr 18, 2014 at 10:25 AM, Seshadri, Balaji wrote: > Hi Jun, > > We could not move to 0.8.1 because of issues we have in upgrade. > > We are still in 0.8-beta1. > > Balaji > > -Original Message- > From: Jun Rao [mailto:jun...

RE: KAFKA-717

2014-04-18 Thread Seshadri, Balaji
The controller not failing over which I feel we got it resolved. The other fix is ZK node not getting deleted when preferred replica election is triggered. https://issues.apache.org/jira/browse/KAFKA-1365 -Original Message- From: Guozhang Wang [mailto:wangg...@gmail.com] Sent: Friday,

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Guozhang Wang
Hello Alex, I think this is a bug on the FetchResponseSend class. Just to confirm, before the kafka.common.KafkaException: This operation cannot be completed on a complete request. do you see other warn/error logs on the current leader? Guozhang On Fri, Apr 18, 2014 at 11:57 AM, Alexander Dem

Re: KAFKA-717

2014-04-18 Thread Guozhang Wang
KAFKA-1365 has been patched. Could you give it a try again after you have tested it on dev environment? On Fri, Apr 18, 2014 at 1:15 PM, Seshadri, Balaji wrote: > The controller not failing over which I feel we got it resolved. > > The other fix is ZK node not getting deleted when preferred repl

RE: KAFKA-717

2014-04-18 Thread Seshadri, Balaji
Ok we will try that. -Original Message- From: Guozhang Wang [mailto:wangg...@gmail.com] Sent: Friday, April 18, 2014 2:32 PM To: users@kafka.apache.org Subject: Re: KAFKA-717 KAFKA-1365 has been patched. Could you give it a try again after you have tested it on dev environment? On Fri

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Alex Demidko
Last time saw this exception when tried to use rebalance leadership with kafka-preferred-replica-election.sh. That's what got in logs: LeaderNode: just kafka.common.KafkaException: This operation cannot be completed on a complete request without any other exceptions. RestartedNode: 2014-04-18 2

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Guozhang Wang
When you are shutting down the restart node, did you see any warn/error on the leader logs? Guozhang On Fri, Apr 18, 2014 at 1:58 PM, Alex Demidko wrote: > Last time saw this exception when tried to use rebalance leadership with > kafka-preferred-replica-election.sh. That's what got in logs: >

RE: Cluster design distribution and JBOD vs RAID

2014-04-18 Thread Maxime Brugidou
Are you sure about that? Our latest tests show that loosing the drive in a jbod setup makes the broker fail (unfortunately). On Apr 18, 2014 9:01 PM, "Bello, Bob" wrote: > Yes you would lose the topic/partitions on the drive. I'm not quite sure > if Kafka can determine what topics/partitions are

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Alex Demidko
Tried to reproduce this one more time. I was using kill -9 shutdown to test reiliability, with graceful termination I haven't seen this problem to arise. Leader node started complaining that ReplicaFetcherThread can't connect to other node and that Producer can't send request to terminated node, bu

Re: too many rebalances in my consumer log and fetcher threads are getting stopped

2014-04-18 Thread Jun Rao
Do you see any rebalances? The fetcher was stopped because it was shut down, which typically happens during rebalances. Thanks, Jun On Fri, Apr 18, 2014 at 11:08 AM, ankit tyagi wrote: > I have checked that. There was no full gc at that time. I have attached > jstat ouput too in my mail. > > I

Re: KafkaException: This operation cannot be completed on a complete request

2014-04-18 Thread Jun Rao
It seems you got OutOfMemoryError, which may leave the broker in a bad state. You probably need a larger heap space. Thanks, Jun On Fri, Apr 18, 2014 at 1:58 PM, Alex Demidko wrote: > Last time saw this exception when tried to use rebalance leadership with > kafka-preferred-replica-election.sh