Re: Relation between DStream and RDDs

2014-03-21 Thread Azuryy
Thanks for sharing here. Sent from my iPhone5s > On 2014年3月21日, at 20:44, Sanjay Awatramani wrote: > > Hi, > > I searched more articles and ran few examples and have clarified my doubts. > This answer by TD in another thread ( > https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0n

Re: Relation between DStream and RDDs

2014-03-21 Thread Sanjay Awatramani
Hi, I searched more articles and ran few examples and have clarified my doubts. This answer by TD in another thread (  https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped me a lot. Here is the summary of my finding: 1) A DStream can consist of 0 or 1 or more RDDs. 2) E

Re: Relation between DStream and RDDs

2014-03-20 Thread andy petrella
Don't see an example, but conceptually it looks like you'll need an according structure like a Monoid. I mean, because if it's not tied to a window, it's an overall computation that has to be increased over time (otherwise it would land in the batch world see after) and that will be the purpose of

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
On Thu, Mar 20, 2014 at 11:57 AM, andy petrella wrote: > also consider creating pairs and use *byKey* operators, and then the key > will be the structure that will be used to consolidate or deduplicate your > data > my2c > > One thing I wonder: imagine I want to sub-divide RDDs in a DStream into s

Re: Relation between DStream and RDDs

2014-03-20 Thread andy petrella
also consider creating pairs and use *byKey* operators, and then the key will be the structure that will be used to consolidate or deduplicate your data my2c On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev < pascal.voitot@gmail.com> wrote: > Actually it's quite simple... > > DStream[T] i

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
Actually it's quite simple... DStream[T] is a stream of RDD[T]. So applying count on DStream is just applying count on each RDD of this DStream. So at the end of count, you have a DStream[Int] containing the same number of RDDs as before but each RDD just contains one element being the count resul

Re: Relation between DStream and RDDs

2014-03-20 Thread Sanjay Awatramani
@TD: I do not need multiple RDDs in a DStream in every batch. On the contrary my logic would work fine if there is only 1 RDD. But then the description for functions like reduce & count (Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStr

Re: Relation between DStream and RDDs

2014-03-20 Thread Pascal Voitot Dev
If I may add my contribution to this discussion if I understand well your question... DStream is discretized stream. It discretized the data stream over windows of time (according to the project code I've read and paper too). so when you write: JavaStreamingContext stcObj = new JavaStreamingConte

Re: Relation between DStream and RDDs

2014-03-19 Thread Tathagata Das
That is a good question. If I understand correctly, you need multiple RDDs from a DStream in *every batch*. Can you elaborate on why do you need multiple RDDs every batch? TD On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani wrote: > Hi, > > As I understand, a DStream consists of 1 or more RD

Relation between DStream and RDDs

2014-03-19 Thread Sanjay Awatramani
Hi, As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run a given func on each and every RDD inside a DStream. I created a simple program which reads log files from a folder every hour: JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 * 60