Hi,
I have set "spark.streaming.receiver.maxRate" to "100". My batch interval is
4sec but still sometimes there are more than 400 records per batch. I am using
spark 1.2.0.
Regards,Laeeq
Hi guys,
I have spark streaming application and I want to increase its
performance.Basically its a design question. My input is like time, s1,s2
...s23. I have to process these columns with same operations. I am running on
40 cores on 10 machines.
1. I am trying to get rid of the loop in the mid
Hi,
Consider I have a tab delimited text file with 10 columns. Each column is a a
set of text. I would like to do a word count for each column. In scala, I would
do the following RDD transformation and action:
val data = sc.textFile("hdfs://namenode/data.txt")
for(i <- 0 until 9){
data.map
Hi,
Any comments please.
Regards,Laeeq
On Friday, April 17, 2015 11:37 AM, Laeeq Ahmed
wrote:
Hi,
I am working with multiple Kafka streams (23 streams) and currently I am
processing them separately. I receive one stream from each topic. I have the
following questions.
1. Spark
.yiv8130515999MsoChpDefault {font-size:10.0pt;} _filtered #yiv8130515999
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv8130515999 div.yiv8130515999WordSection1
{}#yiv8130515999 And what is the message rate of each topic mate – that was the
other part of the required clarifications From: Laeeq Ahmed
[mailto:laeeqsp
message rate of each topic mate – that was the
other part of the required clarifications From: Laeeq Ahmed
[mailto:laeeqsp...@yahoo.com]
Sent: Monday, April 20, 2015 3:38 PM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Equal number of RDD Blocks Hi, I have two different topics and
two
in parallel) giving a start of two different DSTreams
From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com.INVALID]
Sent: Monday, April 20, 2015 3:15 PM
To: user@spark.apache.org
Subject: Equal number of RDD Blocks Hi, I have two streams of data from
kafka. How can I make approx. equal number of
Hi,
I have two streams of data from kafka. How can I make approx. equal number of
RDD blocks of on two executors.Please see the attachement, one worker has 1785
RDD blocks and the other has 26.
Regards,Laeeq
-
To unsubscribe,
Hi,
I am working with multiple Kafka streams (23 streams) and currently I am
processing them separately. I receive one stream from each topic. I have the
following questions.
1. Spark streaming guide suggests to union these streams. Is it possible to
get statistics of each stream even after t
DStream.foreachRDD ( rdd => { val topK = rdd.top(K) ; // use top K })
2. Or, you can use the topK to create another RDD using sc.makeRDD
DStream.transform ( rdd => { val topK = rdd.top(K) ;
rdd.context.makeRDD(topK, numPartitions)})
TD
On Fri, Mar 13, 2015 at 5:58 AM, Laeeq
27;t work as it's not the same as top()
On Fri, Mar 13, 2015 at 11:23 AM, Akhil Das wrote:
> Like this?
>
> dtream.repartition(1).mapPartitions(it => it.take(5))
>
>
>
> Thanks
> Best Regards
>
> On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed
> wrote:
>&
Hi,
repartition is expensive. Looking for an efficient to do this.
Regards,Laeeq
On Friday, March 13, 2015 12:24 PM, Akhil Das
wrote:
Like this?
dtream.repartition(1).mapPartitions(it => it.take(5))
ThanksBest Regards
On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed
wrote:
Hi
Hi,
I normally use dstream.transform whenever I need to use methods which are
available in RDD API but not in streaming API. e.g. dstream.transform(x =>
x.sortByKey(true))
But there are other RDD methods which return types other than RDD. e.g.
dstream.transform(x => x.top(5)) top here returns Ar
Hi,
I have a streaming application where am doing top 10 count in each window which
seems slow. Is there efficient way to do this.
val counts = keyAndValues.map(x =>
math.round(x._3.toDouble)).countByValueAndWindow(Seconds(4), Seconds(4))
val topCounts = counts.repartition(1).map
nstantiate RDDs anyway.
On Fri, Mar 6, 2015 at 7:06 PM, Laeeq Ahmed
wrote:
> Hi,
>
> I am filtering first DStream with the value in second DStream. I also want
> to keep the value of second Dstream. I have done the following and having
> problem with returning new RDD:
Hi,
I am filtering first DStream with the value in second DStream. I also want to
keep the value of second Dstream. I have done the following and having problem
with returning new RDD:
val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1:
RDD[(String,String)], rdd2 : RDD[Int]) =
Hi,
I found windowed mean as fallows:
val counts = myStream.map(x => (x.toDouble,1)).reduceByWindow((a, b) => (a._1
+ b._1, a._2 + b._2),(a, b) => (a._1 - b._1, a._2 - b._2), Seconds(2),
Seconds(2)) val windowMean = counts.map(x => (x._1.toFloat/x._2)) Now I want to
find cumulative moving avera
Hi,
It worked out as this.
val topCounts = sortedCounts.transform(rdd => rdd.zipWithIndex().filter(x=>x._2
<=10))
Regards,Laeeq
On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed
wrote:
Hi Yana,
I also think thatval top10 = your_stream.mapPartitions(rdd => rdd.take
p10 = your_stream.mapPartitions(rdd => rdd.take(10))
would result in an RDD containing the top 10 entries per partition -- am I
wrong?
I am not sure if there is a more efficient way but I think this would work:
sortedCounts.zipWithIndex().filter(x=>x._2 <=10).saveAsText
On Wed, Jan 7, 2015 at 10:38 AM, Laeeq A
top10 = your_stream.mapPartitions(rdd => rdd.take(10))
ThanksBest Regards
On Mon, Jan 5, 2015 at 11:08 PM, Laeeq Ahmed
wrote:
Hi,
I am counting values in each window and find the top values and want to save
only the top 10 frequent values of each window to hdfs rather than all the
values.
ee
Hi,
I am counting values in each window and find the top values and want to save
only the top 10 frequent values of each window to hdfs rather than all the
values.
eegStreams(a) = KafkaUtils.createStream(ssc, zkQuorum, group, Map(args(a) ->
1),StorageLevel.MEMORY_AND_DISK_SER).map(_._2)val count
Hi,
I am using spark standalone on EC2. I can access ephemeral hdfs from
spark-shell interface but I can't access hdfs in standalone application. I am
using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from
my local machine. In my pom file I have given hadoop client as 2.4.
Hi Bill,
Hope the following is what you need.
val zerotime = System.currentTimeMillis()
Then in foreach do the following
//difference = RDDtimeparameter - zerotime //only to find the constant value to
be used later
starttime = (RDDtimeparameter - (zerotime + difference)) - intervalsize
endt
ime
val filteredRDD = windowedRDD.filter(r => r._1 <= currentAppTimeWindowEnd &&
r._1 > currentAppTimeWindowStart) // filter and retain only the records
that fall in the current app-time window
return filteredRDD
})
Hope this helps!
TD
On Thu, Jul 17, 2014 at 5:54 AM, Laeeq Ahmed wrote:
._1 < 4) // filter and retain only
the records that fall in the timestamp-based window
return filteredRDD
})
Consider the input tuples as (1,23),(1.2,34) . . . . . (3.8,54)(4,413) . . .
whereas key is the timestamp.
Regards,
Laeeq
On Saturday, July 12, 2014 8:29 PM, Laeeq
RDD filter out the records
with the desired application time and process them.
TD
On Fri, Jul 11, 2014 at 7:44 AM, Laeeq Ahmed wrote:
Hi,
>
>
>In the spark streaming paper, "slack time" has been suggested for delaying the
>batch creation in case of external timestamps
Hi,
In the spark streaming paper, "slack time" has been suggested for delaying the
batch creation in case of external timestamps. I don't see any such option in
streamingcontext. Is it available in the API?
Also going through the previous posts, queueStream has been suggested for this.
I look
Hi,
For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala
Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.
On Sunday, July 6, 2014 10:20 AM, alessandro finamore
wrote:
On 5 July 2014 23:0
Hi,
First use foreachrdd and then use collect as
DStream.foreachRDD(rdd => {
rdd.collect.foreach({
Also its better to use scala. Less verbose.
Regards,
Laeeq
On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar
wrote:
Hi Team,
Could you please help me to resolve bel
Hi,
For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala
Regards,
Laeeq
On Friday, May 23, 2014 10:33 AM, Mayur Rustagi wrote:
Well its hard to use text data as time of input.
But if
ove is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493
On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed wrote:
> Hi,
>
> The window size in a spark streaming is time based which means we have
> different
Hi,
The window size in a spark streaming is time based which means we have
different number of elements in each window. For example if you have two
streams (might be more) which are related to each other and you want to compare
them in a specific time interval. I am not clear how it will
work.
Hi,
The window size in a spark streaming is time based which means we have
different number of elements in each window. For example if you have two
streams (might be more) which are related to each other and you want to compare
them in a specific time interval. I am not clear how it will work.
s give you
all elements in the sliding window, and you can compute a mean or
variance as you like.
You should be able to do this quite efficiently without recomputing
each time by using reduceByWindow and a running mean / stdev formula.
On Wed, May 21, 2014 at 1:42 PM, Laeeq Ahmed wrote:
Hi,
I want to do union of all RDDs in each window of DStream. I found Dstream.union
and haven't seen anything like DStream.windowRDDUnion.
Is there any way around it?
I want to find mean and SD of all values which comes under each sliding window
for which I need to union all the RDDs in each w
e the
>entire data in the file for your analysis. Spark (give enough memory) can
>process large amounts of data quickly.
>
>On May 15, 2014, at 9:52 AM, Laeeq Ahmed wrote:
>
>
>Hi,
>>
>>I have data in a file. Can I read it as Stream in spark? I know it seems odd
Hi,
I have data in a file. Can I read it as Stream in spark? I know it seems odd to
read file as stream but it has practical applications in real life if I can
read it as stream. It there any other tools which can give this file as stream
to Spark or I have to make batches manually which is not
On Tue, May 6, 2014 at 4:10 PM, Laeeq Ahmed wrote:
> Hi all,
>
> If anyone is using maven for building scala classes with all dependencies
> for spark, please provide a sample pom.xml here. I have having trouble using
> maven for scala simple job though it was working pro
Hi all,
There seems to be a problem. I am not getting mails from spark user group from
two days.
Regards,
Laeeq
Hi,
I use the following code for calculating average. The problem is that the
reduce operation return a DStream here and not a tuple as it normally does
without Streaming. So how can we get the sum and the count from the DStream.
Can we cast it to tuple?
val numbers = ssc.textFileStream(args(
Hi Ian,
Don't use SPARK_MEM in spark-env.sh. It will get it set for all of your jobs.
The better way is to use only the second option
sconf.setExecutorEnv("spark.executor.memory", "4g”) i.e. set it in the driver
program. In this way every job will have memory according to requirment.
For examp
Hi all,
I want to calculate mean and SD for each RDD. I used the followoing code for
mean and now I have to use this mean for SD, but not sure how to use get these
means for each RDD from the DStream, so I can use it for SD. My sample files is
as
1
2
3
4
5
The code is as
val individualpoin
Hi,
I use the following code for calculating average. The problem is that the
reduce operation return a DStream here and not a tuple as it normally does
without Streaming. So how can we get the sum and the count from the DStream.
Can we cast it to tuple?
val numbers = ssc.textFileStream(a
Hi,
I use the following code for calculating average. The problem is that the
reduce operation return a DStream here and not a tuple as it normally does
without Streaming. So how can we get the sum and the count from the DStream.
Can we cast it to tuple?
val numbers = ssc.textFileStream(args(
Hi all,
If anyone is using maven for building scala classes with all dependencies for
spark, please provide a sample pom.xml here. I have having trouble using maven
for scala simple job though it was working properly for java. I have added
scala maven plugin but still getting some issues.
La
Hi all,
Is it possible to read and process multiple streams with spark. I have
eeg(brain waves) csv file with 23 columns Each column is one stream(wave) and
each column has one million values.
I know one way to do it is to take transpose of the file and then give it to
spark and each mapper w
gt;
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On Thu, Apr 17, 2014 at 11:50 AM, Sung Hwan Chung
>>>>> wrote:
>>>>>
>>>>>Debasish,
Hi,
For one of my application, I want to use Random forests(RF) on top of spark. I
see that currenlty MLLib does not have implementation for RF. What other
opensource RF implementations will be great to use with spark in terms of speed?
Regards,
Laeeq Ahmed,
KTH, Sweden.
VerifyError: class
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CreateSnapshotRequestProto
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
What can be cause of this error?
Regards,
Laeeq Ahmed,
PhD Student,
HPCViz, KTH.
http
49 matches
Mail list logo