You don't need sort...use topbykey if your topk number is less...it uses java heap... On May 24, 2015 10:53 AM, "Tal" <wutal...@gmail.com> wrote:
> Hi, > I'm running this piece of code in my program: > > smallRdd.join(largeRdd) > .groupBy { case (id, (_, X(a, _, _))) => a } > .map { case (a, iterable) => a-> iterable.size } > .sortBy({ case (_, count) => count }, ascending = false) > .take(k) > > where basically > smallRdd is an rdd of (Long, Unit) and it has thousands of entries in it, > (it was an rdd of longs but i needed a tuple for the join) > largeRdd is an rdd of (Long, X) where X is a case class containing 3 Longs, > and it has millions of entries in it. > /both rdds have already been sorted by key/ > > what i want is, out of the intersection between the rdds (by key which is > the first Long in the tuple) find the top k which have the most appearances > of the same first value in X (a in this example). > > This code works but takes way too long, it can take up to 10 minutes on > rdds > of sizes 20,000 and 8,000,000, i've been playing around with commenting out > different lines in the process and running it and i can't seem to find a > clear bottleneck, each line seems to be quite costly. > > I am wondering if anyone can think of a better way to do this, > * especially wondering if i should use IndexedRDD for the join, would it > significantly improve the join performance ? > * i really don't like the fact that i'm sorting thousands of entries just > to > get the top k, where usually k << smallRdd.count, is there some kind of > select kth i can do (and then just filter the bigger elements) on rdds ? or > any other way to improve what's happening here ? > > thanks! > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Help-optimizing-some-spark-code-tp23006.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >