I am trying to implement joining with co-partitioned inputs. As described in
the documentation, we can avoid shuffling by partitioning elements with the
same hash code into the same machine.

>>> links =
>>> sc.parallelize([('a','b'),('a','c'),('b','c'),('c','a')]).groupByKey(3)
>>> links.glom().collect()   # check partitions
[[], [('b', ['c'])], [('a', ['b', 'c']), ('c', ['a'])]]

>>> ranks = sc.parallelize([('a',1),('b',1),('c',1)]).groupByKey(3)
>>> ranks.glom().collect()   # check partitions
[[], [('b', [1])], [('a', [1]), ('c', [1])]]

>>> links.join(ranks).glom().collect()   # as you can see it is returning
>>> one partition
[[('a', (['b', 'c'], [1])), ('c', (['a'], [1])), ('b', (['c'], [1]))]]

How can I do this?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/join-with-inputs-co-partitioned-tp4433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to