[Thanks a *lot* for your answers!] That's CoOl, a possible example would be to simply write a for-comprehension that would do this: > > val allEvents = for { > deviceId <- rddFromHdfsOfDeviceId > deviceEvent <- rddFromHdfsOfDeviceEvent(deviceId) > } deviceEvent > val hist = computeHistOf(allEvents)
It's just an example huh ( :-D ) Also here is the answer I was preparing to your previous answer yes, it's what does > And it makes a lot of sense because RDD is """"just"""" an abstraction of > the whole sequence of data and distribute it in a resilient fashion. > So RDD#flatMap would mean: distribute in a resilient fashion a sequence of > RDDs before aggregate (or consolidate or what not verb ^^) them all. > A example to explain myself maybe ;-) (and maybe also to confirm my own > understanding) > * data (logical representation) > > [A1, A2, A3, A4, A5, A6, A7, A8, A9, A10] > * rdd-ized > * > rdd1*[A1, A2, A3] *rdd2*[A4, A5] *rdd3*[A6, A7, A8, A9, A10] > * map with f:A=>B > * > rdd1'*[B1, B2, B3] *rdd2'*[B4, B5] *rdd3'*[B6, B7, B8, B9, B10] > * current flatMap with f: A => Traversable[B] (*not redistributed I > suppose*) > > (intermediate step) *rdd1'*[[B11, B12], [B21] , []] *rdd2'*[[], > [B41,B42,B43]] *rdd3'*[[B61], [B71, B72], [B81, B82, B83, B84], [], []] > > (final result) *rdd1"*[B11, B12, B21] *rdd2**"*[B41,B42,B43] > *rdd3**"*[B61, B71, B72, B81, B82, B83, B84] > * for expression compliant version of flatMap with f: A => RDD[B] (*could > involve some redistributions I presume*) > > (intermediate step) *rdd1'*[*rdd11*[B11] *rdd11*[B12] *rdd12*[B21]] > *rdd2'*[*rdd21*[B41] *rdd22*[B42, B43]] *rdd3'*[*rdd31*[B61] *rdd32*[B71, > B72], *rdd33*[B81] *rdd34*[B82, B83] *rdd35*[B84]] > > (final result) *rdd1**"*[B11, B12, B21] *rdd2**"* > [B41,B42,B43] *rdd3**"*[B61, B71, B72, B81, B82, B83, B84] > > As you can see in the example I tried to depict here, the final result > that would be seen is the same but in between there might have some > redistribution (have a quick look at B81, B82, B83, B84). > That's why, imho, it'd interesting to clear this out by either renaming > the function -- or even awesomely better making flatMap able to > redistribute in-between (which can have a big impact at the reconciliation > => my first mail :-D). > Tell me if I'm completely wrong ;-) -- or if I forget something in my > actual understanding -- or if there is something similar already existing > and I'm not aware of > Cheers, > andy > > > Andy Petrella Belgium (Liège) * ********* Data Engineer in *NextLab <http://nextlab.be/> sprl* (owner) Engaged Citizen Coder for *WAJUG <http://wajug.be/>* (co-founder) Author of *Learning Play! Framework 2 <http://www.packtpub.com/learning-play-framework-2/book>* Bio: on visify <https://www.vizify.com/es/52c3feec2163aa0010001eaa> * ********* Mobile: *+32 495 99 11 04* Mails: - andy.petre...@nextlab.be - andy.petre...@gmail.com * ********* Socials: - Twitter: https://twitter.com/#!/noootsab - LinkedIn: http://be.linkedin.com/in/andypetrella - Blogger: http://ska-la.blogspot.com/ - GitHub: https://github.com/andypetrella - Masterbranch: https://masterbranch.com/andy.petrella On Sat, Mar 15, 2014 at 7:06 PM, Koert Kuipers <ko...@tresata.com> wrote: > just going head first without any thinking, it changed flatMap to > flatMapData and added a flatMap. for FlatMappedRDD my compute is: > > firstParent[T].iterator(split, context).flatMap(f andThen (_.compute(split, > context))) > > > scala> val x = sc.parallelize(1 to 100) > scala> x.flatMap _ > res0: (Int => org.apache.spark.rdd.RDD[Nothing]) => > org.apache.spark.rdd.RDD[Nothing] = <function1> > > my f for flatMap is now f: T => RDD[U], however, i am not sure how to write > a useful function for this :) > > > > On Sat, Mar 15, 2014 at 1:17 PM, Koert Kuipers <ko...@tresata.com> wrote: > > > MappedRDD does: > > firstParent[T].iterator(split, context).map(f) > > > > and FlatMappedRDD: > > firstParent[T].iterator(split, context).flatMap(f) > > > > do yeah seems like its a map or flatMap over the iterator inside, not the > > RDD itself, sort of... > > > > > > On Sat, Mar 15, 2014 at 9:08 AM, andy petrella <andy.petre...@gmail.com > >wrote: > > > >> Yep, > >> Regarding flatMap and an implicit parameter might work like in scala's > >> future for instance: > >> > >> > https://github.com/scala/scala/blob/master/src/library/scala/concurrent/Future.scala#L246 > >> > >> Dunno, still waiting for some insights from the team ^^ > >> > >> andy > >> > >> On Wed, Mar 12, 2014 at 3:23 PM, Pascal Voitot Dev < > >> pascal.voitot....@gmail.com> wrote: > >> > >> > On Wed, Mar 12, 2014 at 3:06 PM, andy petrella < > andy.petre...@gmail.com > >> > >wrote: > >> > > >> > > Folks, > >> > > > >> > > I want just to pint something out... > >> > > I didn't had time yet to sort it out and to think enough to give > >> valuable > >> > > strict explanation of -- event though, intuitively I feel they are a > >> lot > >> > > ===> need spark people or time to move forward. > >> > > But here is the thing regarding *flatMap*. > >> > > > >> > > Actually, it looks like (and again intuitively makes sense) that RDD > >> (and > >> > > of course DStream) aren't monadic and it is reflected in the > >> > implementation > >> > > (and signature) of flatMap. > >> > > > >> > > > > >> > > > * def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = > ** > >> > > > new FlatMappedRDD(this, sc.clean(f))* > >> > > > >> > > > >> > > There!? flatMap (or bind, >>=) should take a function that use the > >> same > >> > > Higher level abstraction in order to be considered as such right? > >> > > > >> > > > >> > I had remarked exactly the same thing and asked myself the same > >> question... > >> > > >> > In this case, it takes a function that returns a TraversableOnce which > >> is > >> > > the type of the content of the RDD, and what represent the output is > >> more > >> > > the content of the RDD than the RDD itself (still right?). > >> > > > >> > > This actually breaks the understand of map and flatMap > >> > > > >> > > > *def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, > >> > > > sc.clean(f))* > >> > > > >> > > > >> > > Indeed, RDD is a functor and the underlying reason for flatMap to > not > >> > take > >> > > A => RDD[B] doesn't show up in map. > >> > > > >> > > This has a lot of consequence actually, because at first one might > >> want > >> > to > >> > > create for-comprehension over RDDs, of even Traversable[F[_]] > >> functions > >> > > like sequence -- and he will get stuck since the signature aren't > >> > > compliant. > >> > > More importantly, Scala uses convention on the structure of a type > to > >> > allow > >> > > for-comp... so where Traversable[F[_]] will fail on type, for-comp > >> will > >> > > failed weirdly. > >> > > > >> > > >> > +1 > >> > > >> > > >> > > > >> > > Again this signature sounds normal, because my intuitive feeling > about > >> > RDDs > >> > > is that they *only can* be monadic but the composition would depend > on > >> > the > >> > > use case and might have heavy consequences (unioning the RDDs for > >> > instance > >> > > => this happening behind the sea can be a big pain, since it > wouldn't > >> be > >> > > efficient at all). > >> > > > >> > > So Yes, RDD could be monadic but with care. > >> > > > >> > > >> > At least we can say, it is a Functor... > >> > Actually, I had imagined studying the monadic aspect of RDDs but as > you > >> > said, it's not so easy... > >> > So for now, I consider them as pseudo-monadic ;) > >> > > >> > > >> > > >> > > So what exposes this signature is a way to flatMap over the inner > >> value, > >> > > like it is almost the case for Map (flatMapValues) > >> > > > >> > > So, wouldn't be better to rename flatMap as flatMapData (or whatever > >> > better > >> > > name)? Or to have flatMap requiring a Monad instance of RDD? > >> > > > >> > > > >> > renaming is to flatMapData or flatTraversableMap sounds good to me > >> (even if > >> > lots of people will hate it...) > >> > flatMap requiring a Monad would make it impossible to use with > >> > for-comprehension certainly no? > >> > > >> > > >> > > Sorry for the prose, just dropped my thoughts and feelings at once > :-/ > >> > > > >> > > > >> > I agree with you in case it can help not to feel alone ;) > >> > > >> > Pascal > >> > > >> > Cheers, > >> > > andy > >> > > > >> > > PS: and my English maybe, although my name's Andy I'm a native > Belgian > >> > ^^. > >> > > > >> > > >> > > > > >