Hi All,
The code: RangePartitioner // This is the sample size we need to have roughly balanced output partitions, capped at 1M. val sampleSize = math.min(20.0 * partitions, 1e6) // Assume the input partitions are roughly balanced and over-sample a little bit. val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt The Constants : 20.0 and 3.0 It is hardcode. Why is it fixed? Is it come from some white paper or research? Regards -Raintung Li