Hi, I am using Spark 1.6.2 and is there a known bug where number of partitions will always be 2 when minPartitions is not specified as below
images = sc.binaryFiles("s3n://AKIAIOJYJILW24BQSIEA:txGkP6YcOHTjBNHPLFbbgmxPfkVQoyUktsVCVKaf@imagefiles-gok/locofiles-data/") I was looking at the source code for PortableDataStream.scala which I believe is used for when we invoke the binary files interface and I see the below code def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) val defaultParallelism = sc.defaultParallelism val files = listStatus(context).asScala val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum val bytesPerCore = totalBytes / defaultParallelism val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) super.setMaxSplitSize(maxSplitSize) } Does it mean that minPartitions will no longer be used in the partition determination calculation? Kindly advice. Thanks, Jayadeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Binary-File-Partition-tp28531.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org