Hi guys,
I want to create two RDD[(K, V)] objects and then collocate partitions with
the same K on one node.
When the same partitioner for two RDDs is used, partitions with the same K
end up being on different nodes.
Here is a small example that illustrates this:
// Let's say I have 10 nodes
val partitioner = new HashPartitioner(10)
// Create RDD
val rdd = sc.parallelize(0 until 10).map(k => (k, computeValue(k)))
// Partition twice using the same partitioner
rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy1 ->
k = " + k) }
rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy2 ->
k = " + k) }
The output on one node is:
Dummy1 -> k = 2
Dummy2 -> k = 7
I was expecting to see the same keys on each node. That was happening under
Spark 0.9.2, but not under Spark 1.0.x.
Anyone has an idea what has changed in the meantime? Or how to get
corresponding partitions on one node?
Thanks,
Milos
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-under-Spark-1-0-x-tp12375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]