If keys are not too many,
You can do like this:

val data = List(
  ("A", Set(1,2,3)),
  ("A", Set(1,2,4)),
  ("B", Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()

rdd.filter(_._1 == "A").flatMap(_._2).distinct.count
rdd.filter(_._1 == "B").flatMap(_._2).distinct.count
rdd.unpersist()

==
data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, 2,
3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) rdd:
org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[Int])] =
ParallelCollectionRDD[6940] at parallelize at <console>:66 res332: rdd.type
= ParallelCollectionRDD[6940] at parallelize at <console>:66 res334: Long =
4 res335: Long = 3 res336: rdd.type = ParallelCollectionRDD[6940] at
parallelize at <console>:66

Regards,
Kevin



On Tue Jan 20 2015 at 2:53:22 PM jagaximo <takuya_seg...@dwango.co.jp>
wrote:

> That i want to do, get unique count for each key. so take map() or
> countByKey(), not get unique count. (because duplicate string is likely to
> be counted)...
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
> that-include-large-Set-tp21248p21254.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to