Re: spark streaming rate limiting from kafka

2014-07-22 Thread Bill Jay
Hi Tobias, I tried to use 10 as numPartition. The number of executors allocated is the number of DStream. Therefore, it seems the parameter does not spread data into many partitions. In order to to that, it seems we have to do repartition. If numPartitions will distribute the data to multiple exec

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Tobias Pfeiffer
Bill, numPartitions means the number of Spark partitions that the data received from Kafka will be split to. It has nothing to do with Kafka partitions, as far as I know. If you create multiple Kafka consumers, it doesn't seem like you can specify which consumer will consume which Kafka partition

Re: spark streaming rate limiting from kafka

2014-07-21 Thread Bill Jay
Hi Tathagata, I am currentlycreating multiple DStream to consumefrom different topics. How can I let each consumer consume from different partitions. I find the following parameters from Spark API: createStream[K, V, U <: Decoder[_], T <: Decoder[_]](jssc: JavaStreamingContext

Re: spark streaming rate limiting from kafka

2014-07-19 Thread Bill Jay
Hi Tobias, It seems that repartition can create more executors for the stages following data receiving. However, the number of executors is still far less than what I require (I specify one core for each executor). Based on the index of the executors in the stage, I find many numbers are missing i

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Dang! Messed it up again! JIRA: https://issues.apache.org/jira/browse/SPARK-1341 Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 11:35 AM, Tathagata Das wrote: > Oops, wrong link! > JIRA: https://github.com/apache/spark/pull/945/files > Github PR: https://gith

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Tathagata Das
Oops, wrong link! JIRA: https://github.com/apache/spark/pull/945/files Github PR: https://github.com/apache/spark/pull/945/files On Fri, Jul 18, 2014 at 7:19 AM, Chen Song wrote: > Thanks Tathagata, > > That would be awesome if Spark streaming can support receiving rate in > general. I tried to

Re: spark streaming rate limiting from kafka

2014-07-18 Thread Chen Song
Thanks Tathagata, That would be awesome if Spark streaming can support receiving rate in general. I tried to explore the link you provided but could not find any specific JIRA related to this? Do you have the JIRA number for this? On Thu, Jul 17, 2014 at 9:21 PM, Tathagata Das wrote: > You ca

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das
You can create multiple kafka stream to partition your topics across them, which will run multiple receivers or multiple executors. This is covered in the Spark streaming guide. And for th

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer
Bill, are you saying, after repartition(400), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay wrote: > I also have an issue consuming from Kafka. When I consume from Kafka, > there are always a single execut

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems that there is still a single executor. Does anyone has an idea how to add parallelism to this job? On Thu, Jul 17, 2014 at 2:06 PM, Chen

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Chen Song
Thanks Luis and Tobias. On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer wrote: > Hi, > > On Wed, Jul 2, 2014 at 1:57 AM, Chen Song wrote: >> >> * Is there a way to control how far Kafka Dstream can read on >> topic-partition (via offset for example). By setting this to a small >> number, it w

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song wrote: > > * Is there a way to control how far Kafka Dstream can read on > topic-partition (via offset for example). By setting this to a small > number, it will force DStream to read less data initially. > Please see the post at http://mail-archives

Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\ 2014-07-01 17:57 GMT+01:00 Chen Song : > In my use case, if I need to stop spark streaming for a while, data would > accumulate a lot on kafka topic-partitions. After I restart spark streaming > job, the worker's heap will go out of memory on the f