This is a back-to-basics question.  How do we know when Spark will clone an
object and distribute it with task closures versus synchronize access to it.

For example, the old rookie mistake of random number generation:

import scala.util.Random
val randRDD = sc.parallelize(0 until 1000).map(ii => Random.nextGaussian)

One can check to see that each partition contains a different set of random
numbers, so the RNG obviously was not cloned, but access was synchronized.
However:

val myMap = collection.mutable.Map.empty[Int,Int]
sc.parallelize(0 until 100).mapPartitions(it => {it.foreach(ii =>
myMap(ii) = ii); Array(myMap).iterator}).collect


This shows that each partition got a copy of the empty map and filled it in
with its portion of the rdd.

Reply via email to