I am trying to implement joining with co-partitioned inputs. As described in the documentation, we can avoid shuffling by partitioning elements with the same hash code into the same machine.
>>> links = >>> sc.parallelize([('a','b'),('a','c'),('b','c'),('c','a')]).groupByKey(3) >>> links.glom().collect() # check partitions [[], [('b', ['c'])], [('a', ['b', 'c']), ('c', ['a'])]] >>> ranks = sc.parallelize([('a',1),('b',1),('c',1)]).groupByKey(3) >>> ranks.glom().collect() # check partitions [[], [('b', [1])], [('a', [1]), ('c', [1])]] >>> links.join(ranks).glom().collect() # as you can see it is returning >>> one partition [[('a', (['b', 'c'], [1])), ('c', (['a'], [1])), ('b', (['c'], [1]))]] How can I do this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/join-with-inputs-co-partitioned-tp4433.html Sent from the Apache Spark User List mailing list archive at Nabble.com.