Re: Filter operation to return two RDDs at once.

๏̯͡๏ Wed, 03 Jun 2015 02:03:20 -0700

Sean,
I am going with your answer. So there is no equivalent of partition.

On Wed, Jun 3, 2015 at 12:34 PM, Jeff Zhang <zjf...@gmail.com> wrote:


> I check the RDD#randSplit, it is much more like multiple one-to-one
> transformation rather than a one-to-multiple transformation.
>
> I write one sample code as following, it would generate 3 stages. Although
> we can use cache here to make it better, If spark can support multiple
> outputs, only 2 stages are needed. ( This would be useful for pig's
> multiple query and hive's self join )
>
>
> val data = 
> sc.textFile("/Users/jzhang/a.log").flatMap(line=>line.split("\\s")).map(w=>(w,1))
> val parts = data.randomSplit(Array(0.2,0.8))
> val joinResult = parts(0).join(parts(1))
> println(joinResult.toDebugString)
>
>
> (1) MapPartitionsRDD[8] at join at WordCount.scala:22 []
>  |  MapPartitionsRDD[7] at join at WordCount.scala:22 []
>  |  CoGroupedRDD[6] at join at WordCount.scala:22 []
>  +-(1) PartitionwiseSampledRDD[4] at randomSplit at WordCount.scala:21 []
>  |  |  MapPartitionsRDD[3] at map at WordCount.scala:20 []
>  |  |  MapPartitionsRDD[2] at flatMap at WordCount.scala:20 []
>  |  |  /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at 
> WordCount.scala:20 []
>  |  |  /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 []
>  +-(1) PartitionwiseSampledRDD[5] at randomSplit at WordCount.scala:21 []
>     |  MapPartitionsRDD[3] at map at WordCount.scala:20 []
>     |  MapPartitionsRDD[2] at flatMap at WordCount.scala:20 []
>     |  /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at 
> WordCount.scala:20 []
>     |  /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 []
>
>
> On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> In the sense here, Spark actually does have operations that make multiple
>> RDDs like randomSplit. However there is not an equivalent of the partition
>> operation which gives the elements that matched and did not match at once.
>>
>> On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> As far as I know, spark don't support multiple outputs
>>>
>>> On Wed, Jun 3, 2015 at 2:15 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Why do you need to do that if filter and content of the resulting rdd
>>>> are exactly same? You may as well declare them as 1 RDD.
>>>> On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote:
>>>>
>>>>> I want to do this
>>>>>
>>>>>     val qtSessionsWithQt = rawQtSession.filter(_._2.
>>>>> qualifiedTreatmentId != NULL_VALUE)
>>>>>
>>>>>     val guidUidMapSessions = rawQtSession.filter(_._2.
>>>>> qualifiedTreatmentId == NULL_VALUE)
>>>>>
>>>>> This will run two different stages can this be done in one stage ?
>>>>>
>>>>>     val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.
>>>>> *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Deepak

Re: Filter operation to return two RDDs at once.

Reply via email to