Re: Sliding Average over Window in Spark Streaming

2016-05-09 Thread Mich Talebzadeh
In general working out minimum or max of say prices (I do not know your use case) is pretty straight forward. For example val maxValue = price.reduceByWindow((x:Double,y:Double) => if(x > y) x else y,Seconds(windowLength), Seconds(slidingInterval)) maxValue.print() The average values are running

Re: Sliding Average over Window in Spark Streaming

2016-05-06 Thread Mich Talebzadeh
Hi Matthias, Say with the following you have "Batch interval" is the basic interval at which the system with receive the data in batches. val ssc = new StreamingContext(sparkConf, Seconds(n)) // window length - The duration of the window below that must be multiple of batch interval n in = > Str

Re: sliding Top N window

2016-03-27 Thread Lars Albertsson
[+spark list again] (I did not want to send "commercial spam" to the list :-)) The reduce function for CMSs is element-wise addition, and the reverse reduce function is element-wise subtraction. The heavy hitters list does not have monoid addition, but you can cheat. I suggest creating a heavy hi

Re: sliding Top N window

2016-03-24 Thread Lars Albertsson
I am not aware of any open source examples. If you search for usages of stream-lib or Algebird, you might be lucky. Twitter uses CMSs, so they might have shared some code or presentation. We created a proprietary prototype of the solution I described, but I am not at liberty to share code. We did

Re: sliding Top N window

2016-03-22 Thread Jatin Kumar
Lar, can you please point to an example? On Mar 23, 2016 2:16 AM, "Lars Albertsson" wrote: > @Jatin, I touched that case briefly in the linked presentation. > > You will have to decide on a time slot size, and then aggregate slots > to form windows. E.g. if you select a time slot of an hour, you

Re: sliding Top N window

2016-03-22 Thread Lars Albertsson
@Jatin, I touched that case briefly in the linked presentation. You will have to decide on a time slot size, and then aggregate slots to form windows. E.g. if you select a time slot of an hour, you build a CMS and a heavy hitter list for the current hour slot, and start new ones at 00 minutes. In

Re: sliding Top N window

2016-03-22 Thread Jatin Kumar
I am sorry, the signature of compare() is different. It should be: implicit val order = new scala.Ordering[(String, Long)] { override def compare(a1: (String, Long), a2: (String, Long)): Int = { a1._2.compareTo(a2._2) } } -- Thanks Jatin On Tue, Mar 22, 2016 at 5:53 PM, J

Re: sliding Top N window

2016-03-22 Thread Jatin Kumar
Hello Yakubovich, I have been looking into a similar problem. @Lars please note that he wants to maintain the top N products over a sliding window, whereas the CountMinSketh algorithm is useful if we want to maintain global top N products list. Please correct me if I am wrong here. I tried using

Re: sliding Top N window

2016-03-21 Thread Rishi Mishra
Hi Alexy, We are also trying to solve similar problems using approximation. Would like to hear more about your usage. We can discuss this offline without boring others. :) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Tue, Mar

Re: sliding Top N window

2016-03-21 Thread Lars Albertsson
Hi, If you can accept approximate top N results, there is a neat solution for this problem: Use an approximate Map structure called Count-Min Sketch, in combination with a list of the M top items, where M > N. When you encounter an item not in the top M, you look up its count in the Count-Min Sket

Re: sliding

2015-07-02 Thread tog
Understood. Thanks for your great help Cheers Guillaume On 2 July 2015 at 23:23, Feynman Liang wrote: > Consider an example dataset [a, b, c, d, e, f] > > After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)] > > After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d,

Re: sliding

2015-07-02 Thread Feynman Liang
Consider an example dataset [a, b, c, d, e, f] After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)] After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d, e, f), 3)] After filter: [((a,b,c), 0), ((d, e, f), 3)], which is what I'm assuming you want (non-overlapping bu

Re: sliding

2015-07-02 Thread tog
Well it did reduce the length of my serie of events. I will have to dig what it did actually ;-) I would assume that it took one out of 3 value, is that correct ? Would it be possible to control a bit more how the value assigned to the bucket is computed for example take the first element, the min

Re: sliding

2015-07-02 Thread Feynman Liang
How about: events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0) That would group the RDD into adjacent buckets of size 3. On Thu, Jul 2, 2015 at 2:33 PM, tog wrote: > Was complaining about the Seq ... > > Moved it to > val eventsfiltered = events.sliding(3).map(s => Event(s(0).time, > (s(0).x

Re: sliding

2015-07-02 Thread tog
Was complaining about the Seq ... Moved it to val eventsfiltered = events.sliding(3).map(s => Event(s(0).time, (s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0)) and that is working. Anyway this is not what I wanted to do, my goal was more to implement bucketing to shorten the

Re: sliding

2015-07-02 Thread Feynman Liang
What's the error you are getting? On Thu, Jul 2, 2015 at 9:37 AM, tog wrote: > Hi > > Sorry for this scala/spark newbie question. I am creating RDD which > represent large time series this way: > val data = sc.textFile("somefile.csv") > > case class Event( > time: Double, > x:

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das
Hello Sanjay, Yes, your understanding of lazy semantics is correct. But ideally every batch should read based on the batch interval provided in the StreamingContext. Can you open a JIRA on this? On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani wrote: > Hi All, > > I found out why this problem

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Sanjay Awatramani
Hi All, I found out why this problem exists. Consider the following scenario: - a DStream is created from any source. (I've checked with file and socket) - No actions are applied to this DStream - Sliding Window operation is applied to this DStream and an action is applied to the sliding window.