If the underlying file system returns files in a non-alphabetical order to java.io.File.listFiles(), Spark reads the partitions out of order. Here¹s an example.
var sc = new SparkContext(³local[3]², ³test²); var rdd1 = sc.parallelize([1,2,3,4,5]); rdd1.saveAsTextFile(³file://path/to/file²); var rdd2 = sc.textFile(³file://path/to/file²); rdd2.collect(); rdd1 is saved to file://path/to/file in three partitions. (I.e. /path/to/file/part-00000, /path/to/file/part-00001, /path/to/file/part-00002) Since File.listFiles(), which is used in org.apache.hadoop.fs.RawLocalFileSystem.listStatus(), returns the partitions out-of-order, rdd2 has the rows in the order different from that of rdd1. Note that File.listFiles() explicitly says that it doesn¹t guarantee the order. The behavior of RawLocalFileSystem is fine for MapReduce jobs because they don¹t care about orders, but for Spark, which has a notion of row order, this looks like a bug. The correct fix would be to sort the files after calling File.listFiles(). This may be possible to fix somewhere by creating a wrapper org.apache.hadoop.fs.FileSystem class that sorts the file list before returning. Is this considered in the original design? Is this a bug? Mingyu
smime.p7s
Description: S/MIME cryptographic signature