This is a back-to-basics question. How do we know when Spark will clone an object and distribute it with task closures versus synchronize access to it.
For example, the old rookie mistake of random number generation: import scala.util.Random val randRDD = sc.parallelize(0 until 1000).map(ii => Random.nextGaussian) One can check to see that each partition contains a different set of random numbers, so the RNG obviously was not cloned, but access was synchronized. However: val myMap = collection.mutable.Map.empty[Int,Int] sc.parallelize(0 until 100).mapPartitions(it => {it.foreach(ii => myMap(ii) = ii); Array(myMap).iterator}).collect This shows that each partition got a copy of the empty map and filled it in with its portion of the rdd.