Hi Grzegorz

        From my understanding, for cogroup operation ( which used by 
intersection), if spark.default.parallelism is not set by user, it won’t bother 
to use the default value, it will use the partition number ( the max one among 
all the rdds in cogroup operation)  to build up a partitioner ( if non of the 
rdd already has a partitioner). This is intend to avoid OOM when process a 
single task. 

        So it explain most of your observations. TextFile generated RDD use 
file split number as partition number and parallelize operation use 
spark.default.parallelism as default partition number.

  But this not explain your local[4] case use textfile for input and with 
spark.default.parallelism set to “7” , the result for rdd2 partition count is 4 
in this case?  Seems to me should not happen.
  
Best Regards,
Raymond Liu

From: Grzegorz Białek [mailto:grzegorz.bia...@codilime.com] 
Sent: Tuesday, August 26, 2014 7:52 PM
To: u...@spark.incubator.apache.org
Subject: spark.default.parallelism bug?

Hi, 

consider the following code:

import org.apache.spark.{SparkContext, SparkConf}
object ParallelismBug extends App {
  var sConf = new SparkConf()
    .setMaster("spark://hostName:7077") // .setMaster("local[4]")
    .set("spark.default.parallelism", "7") // or without it
  val sc = new SparkContext(sConf)
  val rdd = sc.textFile("input/100") // val rdd = sc.parallelize(Array.range(1, 
100))
  val rdd2 = rdd.intersection(rdd)
  println("rdd: " + rdd.partitions.size + " rdd2: " + rdd2.partitions.size)
}

Suppose that input/100 contains 100 files. In above configuration output is 
rdd: 100 rdd2: 7, which seems ok. when we don't set parallelism then output is 
rdd: 100 rdd2: 100, but according to 
https://spark.apache.org/docs/latest/configuration.html#execution-behavior 
it should be rdd: 100 rdd2: 2 (on my 1 core machine).
But when rdd is defined using sc.parallelize results seems ok: rdd: 2 rdd2: 2.
Moreover when master is local[4] and we set parallelism then result is rdd: 100 
rdd2: 4 instead of rdd: 100 rdd2: 7. And when we don't set parallelism it 
behaves like with master spark://hostName:7077.

Do I misunderstanding something, or is it a bug?

Thanks,
Grzegorz

Reply via email to