Great to hear you got solution!!
Cheers!
Kevin
On Wed Jan 21 2015 at 11:13:44 AM jagaximo
wrote:
> Kevin (Sangwoo) Kim wrote
> > If keys are not too many,
> > You can do like this:
> >
> > val data = List(
> > ("A", Set(1,2,3)),
> > ("A", Set(1,2,4)),
> > ("B", Set(1,2,3))
> > )
> > val r
Kevin (Sangwoo) Kim wrote
> If keys are not too many,
> You can do like this:
>
> val data = List(
> ("A", Set(1,2,3)),
> ("A", Set(1,2,4)),
> ("B", Set(1,2,3))
> )
> val rdd = sc.parallelize(data)
> rdd.persist()
>
> rdd.filter(_._1 == "A").flatMap(_._2).distinct.count
> rdd.filter(_._1 =
If keys are not too many,
You can do like this:
val data = List(
("A", Set(1,2,3)),
("A", Set(1,2,4)),
("B", Set(1,2,3))
)
val rdd = sc.parallelize(data)
rdd.persist()
rdd.filter(_._1 == "A").flatMap(_._2).distinct.count
rdd.filter(_._1 == "B").flatMap(_._2).distinct.count
rdd.unpersist()
That i want to do, get unique count for each key. so take map() or
countByKey(), not get unique count. (because duplicate string is likely to
be counted)...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-la
In your code, you're doing combination of large sets, like
(set1 ++ set2).size
which is not a good idea.
(rdd1 ++ rdd2).distinct
is equivalent implementation and will compute in distributed manner.
Not very sure your computation on key'd sets are feasible to be transformed
into RDDs.
Regards,
Kev
As far as I know, the tasks before calling saveAsText are transformations so
that they are lazy computed. Then saveAsText action performs all
transformations and your Set[String] grows up at this time. It creates large
collection if you have few keys and this causes OOM easily when your
executor m
Instead of counted.saveAsText(“/path/to/save/dir") if you call
counted.collect what happens ?
If you still face the same issue please paste the stacktrace here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-inclu