Sean, I am going with your answer. So there is no equivalent of partition. On Wed, Jun 3, 2015 at 12:34 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> I check the RDD#randSplit, it is much more like multiple one-to-one > transformation rather than a one-to-multiple transformation. > > I write one sample code as following, it would generate 3 stages. Although > we can use cache here to make it better, If spark can support multiple > outputs, only 2 stages are needed. ( This would be useful for pig's > multiple query and hive's self join ) > > > val data = > sc.textFile("/Users/jzhang/a.log").flatMap(line=>line.split("\\s")).map(w=>(w,1)) > val parts = data.randomSplit(Array(0.2,0.8)) > val joinResult = parts(0).join(parts(1)) > println(joinResult.toDebugString) > > > (1) MapPartitionsRDD[8] at join at WordCount.scala:22 [] > | MapPartitionsRDD[7] at join at WordCount.scala:22 [] > | CoGroupedRDD[6] at join at WordCount.scala:22 [] > +-(1) PartitionwiseSampledRDD[4] at randomSplit at WordCount.scala:21 [] > | | MapPartitionsRDD[3] at map at WordCount.scala:20 [] > | | MapPartitionsRDD[2] at flatMap at WordCount.scala:20 [] > | | /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at > WordCount.scala:20 [] > | | /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 [] > +-(1) PartitionwiseSampledRDD[5] at randomSplit at WordCount.scala:21 [] > | MapPartitionsRDD[3] at map at WordCount.scala:20 [] > | MapPartitionsRDD[2] at flatMap at WordCount.scala:20 [] > | /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at > WordCount.scala:20 [] > | /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 [] > > > On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen <so...@cloudera.com> wrote: > >> In the sense here, Spark actually does have operations that make multiple >> RDDs like randomSplit. However there is not an equivalent of the partition >> operation which gives the elements that matched and did not match at once. >> >> On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang <zjf...@gmail.com> wrote: >> >>> As far as I know, spark don't support multiple outputs >>> >>> On Wed, Jun 3, 2015 at 2:15 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Why do you need to do that if filter and content of the resulting rdd >>>> are exactly same? You may as well declare them as 1 RDD. >>>> On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote: >>>> >>>>> I want to do this >>>>> >>>>> val qtSessionsWithQt = rawQtSession.filter(_._2. >>>>> qualifiedTreatmentId != NULL_VALUE) >>>>> >>>>> val guidUidMapSessions = rawQtSession.filter(_._2. >>>>> qualifiedTreatmentId == NULL_VALUE) >>>>> >>>>> This will run two different stages can this be done in one stage ? >>>>> >>>>> val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. >>>>> *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Deepak >>>>> >>>>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> > > > -- > Best Regards > > Jeff Zhang > -- Deepak