g1 = pairs1.groupByKey().count() pairs1 = pairs1.groupByKey(g1).cache() g2 = triples.groupByKey().count() pairs2 = pairs2.groupByKey(g2)
pairs = pairs2.join(pairs1) Hi, I want to implement hash-partitioned joining as shown above. But somehow, it is taking so long to perform. As I understand, the above joining is only implemented locally right since they are partitioned respectively? After we partition, they will reside in the same node. So, isn't it supposed to be fast when we partition by keys. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html Sent from the Apache Spark User List mailing list archive at Nabble.com.