Kevin (Sangwoo) Kim wrote
> If keys are not too many, 
> You can do like this:
> 
> val data = List(
>   ("A", Set(1,2,3)),
>   ("A", Set(1,2,4)),
>   ("B", Set(1,2,3))
> )
> val rdd = sc.parallelize(data)
> rdd.persist()
> 
> rdd.filter(_._1 == "A").flatMap(_._2).distinct.count
> rdd.filter(_._1 == "B").flatMap(_._2).distinct.count
> rdd.unpersist()
> 
> ==
> data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1,
> 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3)))
> rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])]
> = ParallelCollectionRDD[6940] at parallelize at 
> <console>
> :66
> res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at 
> <console>
> :66
> res334: Long = 4
> res335: Long = 3
> res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at 
> <console>
> :66
> 
> Regards,
> Kevin

Wow, Got it! good solution
Fortunately, I know what keys have large size Set, I was able to adopt this
approach.

thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to