Hi Grzegorz From my understanding, for cogroup operation ( which used by intersection), if spark.default.parallelism is not set by user, it won’t bother to use the default value, it will use the partition number ( the max one among all the rdds in cogroup operation) to build up a partitioner ( if non of the rdd already has a partitioner). This is intend to avoid OOM when process a single task.
So it explain most of your observations. TextFile generated RDD use file split number as partition number and parallelize operation use spark.default.parallelism as default partition number. But this not explain your local[4] case use textfile for input and with spark.default.parallelism set to “7” , the result for rdd2 partition count is 4 in this case? Seems to me should not happen. Best Regards, Raymond Liu From: Grzegorz Białek [mailto:grzegorz.bia...@codilime.com] Sent: Tuesday, August 26, 2014 7:52 PM To: u...@spark.incubator.apache.org Subject: spark.default.parallelism bug? Hi, consider the following code: import org.apache.spark.{SparkContext, SparkConf} object ParallelismBug extends App { var sConf = new SparkConf() .setMaster("spark://hostName:7077") // .setMaster("local[4]") .set("spark.default.parallelism", "7") // or without it val sc = new SparkContext(sConf) val rdd = sc.textFile("input/100") // val rdd = sc.parallelize(Array.range(1, 100)) val rdd2 = rdd.intersection(rdd) println("rdd: " + rdd.partitions.size + " rdd2: " + rdd2.partitions.size) } Suppose that input/100 contains 100 files. In above configuration output is rdd: 100 rdd2: 7, which seems ok. when we don't set parallelism then output is rdd: 100 rdd2: 100, but according to https://spark.apache.org/docs/latest/configuration.html#execution-behavior it should be rdd: 100 rdd2: 2 (on my 1 core machine). But when rdd is defined using sc.parallelize results seems ok: rdd: 2 rdd2: 2. Moreover when master is local[4] and we set parallelism then result is rdd: 100 rdd2: 4 instead of rdd: 100 rdd2: 7. And when we don't set parallelism it behaves like with master spark://hostName:7077. Do I misunderstanding something, or is it a bug? Thanks, Grzegorz