If keys are not too many, You can do like this: val data = List( ("A", Set(1,2,3)), ("A", Set(1,2,4)), ("B", Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist()
rdd.filter(_._1 == "A").flatMap(_._2).distinct.count rdd.filter(_._1 == "B").flatMap(_._2).distinct.count rdd.unpersist() == data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] = ParallelCollectionRDD[6940] at parallelize at <console>:66 res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at <console>:66 res334: Long = 4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at <console>:66 Regards, Kevin On Tue Jan 20 2015 at 2:53:22 PM jagaximo <takuya_seg...@dwango.co.jp> wrote: > That i want to do, get unique count for each key. so take map() or > countByKey(), not get unique count. (because duplicate string is likely to > be counted)... > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String- > that-include-large-Set-tp21248p21254.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >