Kevin (Sangwoo) Kim wrote > If keys are not too many, > You can do like this: > > val data = List( > ("A", Set(1,2,3)), > ("A", Set(1,2,4)), > ("B", Set(1,2,3)) > ) > val rdd = sc.parallelize(data) > rdd.persist() > > rdd.filter(_._1 == "A").flatMap(_._2).distinct.count > rdd.filter(_._1 == "B").flatMap(_._2).distinct.count > rdd.unpersist() > > == > data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, > 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) > rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] > = ParallelCollectionRDD[6940] at parallelize at > <console> > :66 > res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at > <console> > :66 > res334: Long = 4 > res335: Long = 3 > res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at > <console> > :66 > > Regards, > Kevin
Wow, Got it! good solution Fortunately, I know what keys have large size Set, I was able to adopt this approach. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248p21275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org