Hi, following my previous post
<http://apache-spark-user-list.1001560.n3.nabble.com/Help-optimizing-some-spark-code-tc23006.html>
  
I have been trying to find the best way to intersect an RDD of Longs (ids)
with an RDD of (id, value) pairs such that i end up with just the values of
the ids from the first rdd

for example if i had 
   rdd1 = [ 1L ]
   rdd2 = [ (1L, "one"), (2L, "two"), (3L, "three") ] 
then i'd want to get the result rdd
  result = [ "one" ]

** note that rdd2 is far larger than rdd1 (millions vs thousands)

what i have been doing is mapping the first rdd to a tuple like so 
  rdd1.map(_ -> ()) so now rdd1 = [ (1L, ()) ]

and then i have 2 rdds of (Long, _) and i can use join operations, but the
joins seemed really slow, and so i thought i'd try to optimize them by
repartitioning the keys in advance to the same partitions, and also sorting
them, but that didn't help (sorting helps a bit, repartitioning didn't seem
to do much - im currently running locally)

-------

** so then i thought i'd use broadcast variables, the first problem i
encountered was, for a broadcast variable i need the variable to first be in
the drivers memory, so i can use sc.broadcast(...) but the map that results
from rdd2.collectAsMap() is too large for my memory, can i broadcast in
parts, or better yet straight from an rdd *?*

** the second thing that i found problematic here was when i reduced the
size of the rdd just for testing, and i sent it as a broadcast variable (as
a map of id -> value) then i did 
  rdd1.map(broadcastVariable.value) 
to create the new rdd i wanted, it was very very significantly slower than
if i just do that same action locally, even though all those actions should
take place on the executioner (which is local) without any shuffling or
anything like that since all partitions have the broadcast varialbe, so why
is it so slow ? and what can i do about it :/ *?*

I'd love to get any suggestions here!
thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-my-performance-on-local-really-slow-tp23088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to