spark.streaming.receiver.maxRate Not taking effect

2015-07-01 Thread Laeeq Ahmed
Hi, I have set "spark.streaming.receiver.maxRate" to "100". My batch interval is 4sec but still sometimes there are more than 400 records per batch. I am using spark 1.2.0. Regards,Laeeq

process independent columns with same operations

2015-05-26 Thread Laeeq Ahmed
Hi guys, I have spark streaming application and I want to increase its performance.Basically its a design question. My input is like time, s1,s2 ...s23. I have to process these columns with same operations. I am running on 40 cores on 10 machines. 1. I am trying to get rid of the loop in the mid

Processing multiple columns in parallel

2015-05-18 Thread Laeeq Ahmed
Hi, Consider I have a tab delimited text file with 10 columns. Each column is a a set of text. I would like to do a word count for each column. In scala, I would do the following RDD transformation and action:  val data = sc.textFile("hdfs://namenode/data.txt")  for(i <- 0 until 9){     data.map

Re: Some questions on Multiple Streams

2015-04-24 Thread Laeeq Ahmed
Hi, Any comments please. Regards,Laeeq On Friday, April 17, 2015 11:37 AM, Laeeq Ahmed wrote: Hi, I am working with multiple Kafka streams (23 streams) and currently I am processing them separately. I receive one stream from each topic. I have the following questions. 1.    Spark

Re: Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
.yiv8130515999MsoChpDefault {font-size:10.0pt;} _filtered #yiv8130515999 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv8130515999 div.yiv8130515999WordSection1 {}#yiv8130515999 And what is the message rate of each topic mate – that was the other part of the required clarifications  From: Laeeq Ahmed [mailto:laeeqsp

Re: Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
message rate of each topic mate – that was the other part of the required clarifications  From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:38 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal number of RDD Blocks  Hi,  I have two different topics and two

Re: Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
in parallel) giving a start of two different DSTreams   From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com.INVALID] Sent: Monday, April 20, 2015 3:15 PM To: user@spark.apache.org Subject: Equal number of RDD Blocks  Hi,  I have two streams of data from kafka. How can I make approx. equal number of

Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
Hi, I have two streams of data from kafka. How can I make approx. equal number of RDD blocks of on two executors.Please see the attachement, one worker has 1785 RDD blocks and the other has 26.  Regards,Laeeq - To unsubscribe,

Some questions on Multiple Streams

2015-04-17 Thread Laeeq Ahmed
Hi, I am working with multiple Kafka streams (23 streams) and currently I am processing them separately. I receive one stream from each topic. I have the following questions. 1.    Spark streaming guide suggests to union these streams. Is it possible to get statistics of each stream even after t

Re: Using rdd methods with Dstream

2015-03-14 Thread Laeeq Ahmed
  DStream.foreachRDD ( rdd => {    val topK = rdd.top(K) ;    // use top K }) 2. Or, you can use the topK to create another RDD using sc.makeRDD DStream.transform ( rdd => {    val topK = rdd.top(K) ;     rdd.context.makeRDD(topK, numPartitions)}) TD   On Fri, Mar 13, 2015 at 5:58 AM, Laeeq

Re: Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
27;t work as it's not the same as top() On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das wrote: > Like this? > > dtream.repartition(1).mapPartitions(it => it.take(5)) > > > > Thanks > Best Regards > > On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed > wrote: >&

Re: Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, repartition is expensive. Looking for an efficient to do this. Regards,Laeeq On Friday, March 13, 2015 12:24 PM, Akhil Das wrote: Like this? dtream.repartition(1).mapPartitions(it => it.take(5)) ThanksBest Regards On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed wrote: Hi

Using rdd methods with Dstream

2015-03-13 Thread Laeeq Ahmed
Hi, I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming API. e.g. dstream.transform(x => x.sortByKey(true)) But there are other RDD methods which return types other than RDD. e.g. dstream.transform(x => x.top(5)) top here returns Ar

Efficient Top count in each window

2015-03-12 Thread Laeeq Ahmed
Hi,  I have a streaming application where am doing top 10 count in each window which seems slow. Is there efficient way to do this. val counts = keyAndValues.map(x => math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))        val topCounts = counts.repartition(1).map

Re: Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
nstantiate RDDs anyway. On Fri, Mar 6, 2015 at 7:06 PM, Laeeq Ahmed wrote: > Hi, > > I am filtering first DStream with the value in second DStream. I also want > to keep the value of second Dstream. I have done the following and having > problem with returning new RDD:

Help with transformWith in SparkStreaming

2015-03-06 Thread Laeeq Ahmed
Hi, I am filtering first DStream with the value in second DStream. I also want to keep the value of second Dstream. I have done the following and having problem with returning new RDD: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int]) =

Cumulative moving average of stream

2015-02-10 Thread Laeeq Ahmed
Hi, I found windowed mean as fallows:  val counts = myStream.map(x => (x.toDouble,1)).reduceByWindow((a, b) => (a._1 + b._1, a._2 + b._2),(a, b) => (a._1 - b._1, a._2 - b._2), Seconds(2), Seconds(2)) val windowMean = counts.map(x => (x._1.toFloat/x._2)) Now I want to find cumulative moving avera

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Laeeq Ahmed
Hi, It worked out as this. val topCounts = sortedCounts.transform(rdd => rdd.zipWithIndex().filter(x=>x._2 <=10)) Regards,Laeeq On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed wrote: Hi Yana, I also think thatval top10 = your_stream.mapPartitions(rdd => rdd.take

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Laeeq Ahmed
p10 = your_stream.mapPartitions(rdd => rdd.take(10)) would result in an RDD containing the top 10 entries per partition -- am I wrong? I am not sure if there is a more efficient way but I think this would work: sortedCounts.zipWithIndex().filter(x=>x._2 <=10).saveAsText On Wed, Jan 7, 2015 at 10:38 AM, Laeeq A

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-07 Thread Laeeq Ahmed
top10 = your_stream.mapPartitions(rdd => rdd.take(10)) ThanksBest Regards On Mon, Jan 5, 2015 at 11:08 PM, Laeeq Ahmed wrote: Hi, I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values. ee

Saving partial (top 10) DStream windows to hdfs

2015-01-05 Thread Laeeq Ahmed
Hi, I am counting values in each window and find the top values and want to save only the top 10 frequent values of each window to hdfs rather than all the values. eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) -> 1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)val count

Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Laeeq Ahmed
Hi, I am using spark standalone on EC2. I can access ephemeral hdfs from spark-shell interface but I can't access hdfs in standalone application. I am using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from my local machine. In my pom file I have given hadoop client as 2.4.

Re: Spark Streaming timestamps

2014-07-29 Thread Laeeq Ahmed
Hi Bill, Hope the following is what you need. val zerotime = System.currentTimeMillis() Then in foreach do the following //difference = RDDtimeparameter - zerotime //only to find the constant value to be used later starttime = (RDDtimeparameter - (zerotime + difference)) -  intervalsize endt

Re: Spark Streaming timing considerations

2014-07-21 Thread Laeeq Ahmed
ime  val filteredRDD = windowedRDD.filter(r => r._1 <= currentAppTimeWindowEnd && r._1 > currentAppTimeWindowStart) // filter and retain only the records that fall in the current app-time window  return filteredRDD  }) Hope this helps! TD On Thu, Jul 17, 2014 at 5:54 AM, Laeeq Ahmed wrote:

Re: Spark Streaming timing considerations

2014-07-17 Thread Laeeq Ahmed
._1 < 4)     // filter and retain only the records that fall in the timestamp-based window  return filteredRDD }) Consider the input tuples as (1,23),(1.2,34) . . . . . (3.8,54)(4,413) . . .  whereas key is the timestamp. Regards, Laeeq   On Saturday, July 12, 2014 8:29 PM, Laeeq

Re: Spark Streaming timing considerations

2014-07-12 Thread Laeeq Ahmed
RDD filter out the records with the desired application time and process them.  TD On Fri, Jul 11, 2014 at 7:44 AM, Laeeq Ahmed wrote: Hi, > > >In the spark streaming paper, "slack time" has been suggested for delaying the >batch creation in case of external timestamps

Spark Streaming timing considerations

2014-07-11 Thread Laeeq Ahmed
Hi, In the spark streaming paper, "slack time" has been suggested for delaying the batch creation in case of external timestamps. I don't see any such option in streamingcontext. Is it available in the API? Also going through the previous posts, queueStream has been suggested for this. I look

Re: window analysis with Spark and Spark streaming

2014-07-09 Thread Laeeq Ahmed
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala Regards, Laeeq, PhD candidatte, KTH, Stockholm.   On Sunday, July 6, 2014 10:20 AM, alessandro finamore wrote: On 5 July 2014 23:0

Re: how to convert JavaDStream to JavaRDD

2014-07-09 Thread Laeeq Ahmed
Hi, First use foreachrdd and then use collect as DStream.foreachRDD(rdd => {    rdd.collect.foreach({ Also its better to use scala. Less verbose. Regards, Laeeq On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar wrote: Hi Team, Could you please help me to resolve bel

Re: controlling the time in spark-streaming

2014-07-09 Thread Laeeq Ahmed
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala   Regards, Laeeq On Friday, May 23, 2014 10:33 AM, Mayur Rustagi wrote: Well its hard to use text data as time of input.  But if

Re: Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
ove is that the streams just pump data in at different rates -- first one got 7462 points in the first batch interval, whereas stream2 saw 10493 On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed wrote: > Hi, > > The window size in a spark streaming is time based which means we have > different

Window Size

2014-07-01 Thread Laeeq Ahmed
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work.

Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work.

Re: RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
s give you all elements in the sliding window, and you can compute a mean or variance as you like. You should be able to do this quite efficiently without recomputing each time by using reduceByWindow and a running mean / stdev formula. On Wed, May 21, 2014 at 1:42 PM, Laeeq Ahmed wrote:

RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
Hi, I want to do union of all RDDs in each window of DStream. I found Dstream.union and haven't seen anything like DStream.windowRDDUnion. Is there any way around it? I want to find mean and SD of all values which comes under each sliding window for which I need to union all the RDDs in each w

Re: Historical Data as Stream

2014-05-17 Thread Laeeq Ahmed
e the >entire data in the file for your analysis. Spark (give enough memory) can >process large amounts of data quickly.  > >On May 15, 2014, at 9:52 AM, Laeeq Ahmed wrote: > > >Hi, >> >>I have data in a file. Can I read it as Stream in spark? I know it seems odd

Historical Data as Stream

2014-05-16 Thread Laeeq Ahmed
Hi, I have data in a file. Can I read it as Stream in spark? I know it seems odd to read file as stream but it has practical applications in real life if I can read it as stream. It there any other tools which can give this file as stream to Spark or I have to make batches manually which is not

Re: maven for building scala simple program

2014-05-16 Thread Laeeq Ahmed
      On Tue, May 6, 2014 at 4:10 PM, Laeeq Ahmed wrote: > Hi all, > > If anyone is using maven for building scala classes with all dependencies > for spark, please provide a sample pom.xml here. I have having trouble using > maven for scala simple job though it was working pro

Not getting mails from user group

2014-05-15 Thread Laeeq Ahmed
Hi all, There seems to be a problem. I am not getting mails from spark user group from two days. Regards, Laeeq

Average of each RDD in Stream

2014-05-15 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple? val numbers = ssc.textFileStream(args(

Re: Easy one

2014-05-15 Thread Laeeq Ahmed
Hi Ian, Don't use SPARK_MEM in spark-env.sh. It will get it set for all of your jobs. The better way is to use only the second option sconf.setExecutorEnv("spark.executor.memory", "4g”) i.e. set it in the driver program. In this way every job will have memory according to requirment. For examp

Taking value out from Dstream for each RDD

2014-05-15 Thread Laeeq Ahmed
Hi all, I want to calculate mean and SD for each RDD. I used the followoing code for mean and now I have to use this mean for SD, but not sure how to use get these means for each RDD from the DStream, so I can use it for SD. My sample files is as 1 2 3 4 5 The code is as  val individualpoin

Average of each RDD in Stream

2014-05-14 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple?     val numbers = ssc.textFileStream(a

Average of each RDD in Stream

2014-05-12 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple? val numbers = ssc.textFileStream(args(

maven for building scala simple program

2014-05-06 Thread Laeeq Ahmed
Hi all,   If anyone is using maven for building scala classes with all dependencies for spark, please provide a sample pom.xml here. I have having trouble using maven for scala simple job though it was working properly for java. I have added scala maven plugin but still getting some issues.   La

Multiple Streams with Spark Streaming

2014-05-01 Thread Laeeq Ahmed
Hi all, Is it possible to read and process multiple streams with spark. I have eeg(brain waves) csv file with 23 columns  Each column is one stream(wave) and each column has one million values. I know one way to do it is to take transpose of the file and then give it to spark and each mapper w

Re: Random Forest on Spark

2014-04-18 Thread Laeeq Ahmed
gt; >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung >>>>> wrote: >>>>> >>>>>Debasish,

Random Forest on Spark

2014-04-17 Thread Laeeq Ahmed
Hi, For one of my application, I want to use Random forests(RF) on top of spark. I see that currenlty MLLib does not have implementation for RF. What other opensource RF implementations will be great to use with spark in terms of speed? Regards, Laeeq Ahmed, KTH, Sweden.

Error while reading from HDFS Simple application

2014-03-20 Thread Laeeq Ahmed
VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CreateSnapshotRequestProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; What can be cause of this error? Regards, Laeeq Ahmed, PhD Student, HPCViz, KTH. http