just going head first without any thinking, it changed flatMap to flatMapData and added a flatMap. for FlatMappedRDD my compute is:
firstParent[T].iterator(split, context).flatMap(f andThen (_.compute(split, context))) scala> val x = sc.parallelize(1 to 100) scala> x.flatMap _ res0: (Int => org.apache.spark.rdd.RDD[Nothing]) => org.apache.spark.rdd.RDD[Nothing] = <function1> my f for flatMap is now f: T => RDD[U], however, i am not sure how to write a useful function for this :) On Sat, Mar 15, 2014 at 1:17 PM, Koert Kuipers <ko...@tresata.com> wrote: > MappedRDD does: > firstParent[T].iterator(split, context).map(f) > > and FlatMappedRDD: > firstParent[T].iterator(split, context).flatMap(f) > > do yeah seems like its a map or flatMap over the iterator inside, not the > RDD itself, sort of... > > > On Sat, Mar 15, 2014 at 9:08 AM, andy petrella <andy.petre...@gmail.com>wrote: > >> Yep, >> Regarding flatMap and an implicit parameter might work like in scala's >> future for instance: >> >> https://github.com/scala/scala/blob/master/src/library/scala/concurrent/Future.scala#L246 >> >> Dunno, still waiting for some insights from the team ^^ >> >> andy >> >> On Wed, Mar 12, 2014 at 3:23 PM, Pascal Voitot Dev < >> pascal.voitot....@gmail.com> wrote: >> >> > On Wed, Mar 12, 2014 at 3:06 PM, andy petrella <andy.petre...@gmail.com >> > >wrote: >> > >> > > Folks, >> > > >> > > I want just to pint something out... >> > > I didn't had time yet to sort it out and to think enough to give >> valuable >> > > strict explanation of -- event though, intuitively I feel they are a >> lot >> > > ===> need spark people or time to move forward. >> > > But here is the thing regarding *flatMap*. >> > > >> > > Actually, it looks like (and again intuitively makes sense) that RDD >> (and >> > > of course DStream) aren't monadic and it is reflected in the >> > implementation >> > > (and signature) of flatMap. >> > > >> > > > >> > > > * def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = ** >> > > > new FlatMappedRDD(this, sc.clean(f))* >> > > >> > > >> > > There!? flatMap (or bind, >>=) should take a function that use the >> same >> > > Higher level abstraction in order to be considered as such right? >> > > >> > > >> > I had remarked exactly the same thing and asked myself the same >> question... >> > >> > In this case, it takes a function that returns a TraversableOnce which >> is >> > > the type of the content of the RDD, and what represent the output is >> more >> > > the content of the RDD than the RDD itself (still right?). >> > > >> > > This actually breaks the understand of map and flatMap >> > > >> > > > *def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, >> > > > sc.clean(f))* >> > > >> > > >> > > Indeed, RDD is a functor and the underlying reason for flatMap to not >> > take >> > > A => RDD[B] doesn't show up in map. >> > > >> > > This has a lot of consequence actually, because at first one might >> want >> > to >> > > create for-comprehension over RDDs, of even Traversable[F[_]] >> functions >> > > like sequence -- and he will get stuck since the signature aren't >> > > compliant. >> > > More importantly, Scala uses convention on the structure of a type to >> > allow >> > > for-comp... so where Traversable[F[_]] will fail on type, for-comp >> will >> > > failed weirdly. >> > > >> > >> > +1 >> > >> > >> > > >> > > Again this signature sounds normal, because my intuitive feeling about >> > RDDs >> > > is that they *only can* be monadic but the composition would depend on >> > the >> > > use case and might have heavy consequences (unioning the RDDs for >> > instance >> > > => this happening behind the sea can be a big pain, since it wouldn't >> be >> > > efficient at all). >> > > >> > > So Yes, RDD could be monadic but with care. >> > > >> > >> > At least we can say, it is a Functor... >> > Actually, I had imagined studying the monadic aspect of RDDs but as you >> > said, it's not so easy... >> > So for now, I consider them as pseudo-monadic ;) >> > >> > >> > >> > > So what exposes this signature is a way to flatMap over the inner >> value, >> > > like it is almost the case for Map (flatMapValues) >> > > >> > > So, wouldn't be better to rename flatMap as flatMapData (or whatever >> > better >> > > name)? Or to have flatMap requiring a Monad instance of RDD? >> > > >> > > >> > renaming is to flatMapData or flatTraversableMap sounds good to me >> (even if >> > lots of people will hate it...) >> > flatMap requiring a Monad would make it impossible to use with >> > for-comprehension certainly no? >> > >> > >> > > Sorry for the prose, just dropped my thoughts and feelings at once :-/ >> > > >> > > >> > I agree with you in case it can help not to feel alone ;) >> > >> > Pascal >> > >> > Cheers, >> > > andy >> > > >> > > PS: and my English maybe, although my name's Andy I'm a native Belgian >> > ^^. >> > > >> > >> > >