Hi, following my previous post <http://apache-spark-user-list.1001560.n3.nabble.com/Help-optimizing-some-spark-code-tc23006.html> I have been trying to find the best way to intersect an RDD of Longs (ids) with an RDD of (id, value) pairs such that i end up with just the values of the ids from the first rdd
for example if i had rdd1 = [ 1L ] rdd2 = [ (1L, "one"), (2L, "two"), (3L, "three") ] then i'd want to get the result rdd result = [ "one" ] ** note that rdd2 is far larger than rdd1 (millions vs thousands) what i have been doing is mapping the first rdd to a tuple like so rdd1.map(_ -> ()) so now rdd1 = [ (1L, ()) ] and then i have 2 rdds of (Long, _) and i can use join operations, but the joins seemed really slow, and so i thought i'd try to optimize them by repartitioning the keys in advance to the same partitions, and also sorting them, but that didn't help (sorting helps a bit, repartitioning didn't seem to do much - im currently running locally) ------- ** so then i thought i'd use broadcast variables, the first problem i encountered was, for a broadcast variable i need the variable to first be in the drivers memory, so i can use sc.broadcast(...) but the map that results from rdd2.collectAsMap() is too large for my memory, can i broadcast in parts, or better yet straight from an rdd *?* ** the second thing that i found problematic here was when i reduced the size of the rdd just for testing, and i sent it as a broadcast variable (as a map of id -> value) then i did rdd1.map(broadcastVariable.value) to create the new rdd i wanted, it was very very significantly slower than if i just do that same action locally, even though all those actions should take place on the executioner (which is local) without any shuffling or anything like that since all partitions have the broadcast varialbe, so why is it so slow ? and what can i do about it :/ *?* I'd love to get any suggestions here! thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-my-performance-on-local-really-slow-tp23088.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org