yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= >Boolean):(Repr,Repr)
Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice → hence tasks will be scheduled distinctly, and data read twice. Choose what's best for you! hth, andy On Wed Dec 17 2014 at 5:57:56 PM Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi all, > > I would like to be able to split a RDD in two pieces according to a > predicate. That would be equivalent to applying filter twice, with the > predicate and its complement, which is also similar to Haskell's partition > list function ( > http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-List.html). > There is currently any way to do this in Spark?, or maybe anyone has a > suggestion about how to implent this by modifying the Spark source. I think > this is valuable because sometimes I need to split a RDD in several groups > that are too big to fit in the memory of a single thread, so pair RDDs are > not solution for those cases. A generalization to n parts of Haskell's > partition would do the job. > > Thanks a lot for your help. > > Greetings, > > Juan Rodriguez >