Partitioning under Spark 1.0.x

losmi83 Tue, 19 Aug 2014 03:54:07 -0700

Hi guys,

I want to create two RDD[(K, V)] objects and then collocate partitions with
the same K on one node. 
When the same partitioner for two RDDs is used, partitions with the same K
end up being on different nodes.
Here is a small example that illustrates this:


    // Let's say I have 10 nodes
    val partitioner = new HashPartitioner(10)     
    
    // Create RDD
    val rdd = sc.parallelize(0 until 10).map(k => (k, computeValue(k)))
    
    // Partition twice using the same partitioner
    rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy1 ->
k = " + k) } 
    rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy2 ->
k = " + k) } 

The output on one node is: 
    Dummy1 -> k = 2 
    Dummy2 -> k = 7 

I was expecting to see the same keys on each node. That was happening under
Spark 0.9.2, but not under Spark 1.0.x.

Anyone has an idea what has changed in the meantime? Or how to get
corresponding partitions on one node?

Thanks,
Milos




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-under-Spark-1-0-x-tp12375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Partitioning under Spark 1.0.x

Reply via email to