Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread Kevin (Sangwoo) Kim
Great to hear you got solution!! Cheers! Kevin On Wed Jan 21 2015 at 11:13:44 AM jagaximo wrote: > Kevin (Sangwoo) Kim wrote > > If keys are not too many, > > You can do like this: > > > > val data = List( > > ("A", Set(1,2,3)), > > ("A", Set(1,2,4)), > > ("B", Set(1,2,3)) > > ) > > val r

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread jagaximo
Kevin (Sangwoo) Kim wrote > If keys are not too many, > You can do like this: > > val data = List( > ("A", Set(1,2,3)), > ("A", Set(1,2,4)), > ("B", Set(1,2,3)) > ) > val rdd = sc.parallelize(data) > rdd.persist() > > rdd.filter(_._1 == "A").flatMap(_._2).distinct.count > rdd.filter(_._1 =

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
If keys are not too many, You can do like this: val data = List( ("A", Set(1,2,3)), ("A", Set(1,2,4)), ("B", Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == "A").flatMap(_._2).distinct.count rdd.filter(_._1 == "B").flatMap(_._2).distinct.count rdd.unpersist()

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread jagaximo
That i want to do, get unique count for each key. so take map() or countByKey(), not get unique count. (because duplicate string is likely to be counted)... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-la

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
In your code, you're doing combination of large sets, like (set1 ++ set2).size which is not a good idea. (rdd1 ++ rdd2).distinct is equivalent implementation and will compute in distributed manner. Not very sure your computation on key'd sets are feasible to be transformed into RDDs. Regards, Kev

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin Jung
As far as I know, the tasks before calling saveAsText are transformations so that they are lazy computed. Then saveAsText action performs all transformations and your Set[String] grows up at this time. It creates large collection if you have few keys and this causes OOM easily when your executor m

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Pankaj Narang
Instead of counted.saveAsText(“/path/to/save/dir") if you call counted.collect what happens ? If you still face the same issue please paste the stacktrace here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-inclu