You don't need sort...use topbykey if your topk number is less...it uses
java heap...
On May 24, 2015 10:53 AM, "Tal" <wutal...@gmail.com> wrote:

> Hi,
> I'm running this piece of code in my program:
>
>     smallRdd.join(largeRdd)
>       .groupBy { case (id, (_, X(a, _, _))) => a }
>       .map { case (a, iterable) => a-> iterable.size }
>       .sortBy({ case (_, count) => count }, ascending = false)
>       .take(k)
>
> where basically
> smallRdd is an rdd of (Long, Unit) and it has thousands of entries in it,
> (it was an rdd of longs but i needed a tuple for the join)
> largeRdd is an rdd of (Long, X) where X is a case class containing 3 Longs,
> and it has millions of entries in it.
> /both rdds have already been sorted by key/
>
> what i want is, out of the intersection between the rdds (by key which is
> the first Long in the tuple) find the top k which have the most appearances
> of the same first value in X (a in this example).
>
> This code works but takes way too long, it can take up to 10 minutes on
> rdds
> of sizes 20,000 and 8,000,000, i've been playing around with commenting out
> different lines in the process and running it and i can't seem to find a
> clear bottleneck, each line seems to be quite costly.
>
> I am wondering if anyone can think of a better way to do this,
> * especially wondering if i should use IndexedRDD for the join, would it
> significantly improve the join performance ?
> * i really don't like the fact that i'm sorting thousands of entries just
> to
> get the top k, where usually k << smallRdd.count, is there some kind of
> select kth i can do (and then just filter the bigger elements) on rdds ? or
> any other way to improve what's happening here ?
>
> thanks!
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Help-optimizing-some-spark-code-tp23006.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to