If the underlying file system returns files in a non-alphabetical order to
java.io.File.listFiles(), Spark reads the partitions out of order. Here¹s an
example.

var sc = new SparkContext(³local[3]², ³test²);
var rdd1 = sc.parallelize([1,2,3,4,5]);
rdd1.saveAsTextFile(³file://path/to/file²);
var rdd2 = sc.textFile(³file://path/to/file²);
rdd2.collect();

rdd1 is saved to file://path/to/file in three partitions. (I.e.
/path/to/file/part-00000, /path/to/file/part-00001,
/path/to/file/part-00002) Since File.listFiles(), which is used in
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(), returns the partitions
out-of-order, rdd2 has the rows in the order different from that of rdd1.
Note that File.listFiles() explicitly says that it doesn¹t guarantee the
order.

The behavior of RawLocalFileSystem is fine for MapReduce jobs because they
don¹t care about orders, but for Spark, which has a notion of row order,
this looks like a bug. The correct fix would be to sort the files after
calling File.listFiles().

This may be possible to fix somewhere by creating a wrapper
org.apache.hadoop.fs.FileSystem class that sorts the file list before
returning. Is this considered in the original design? Is this a bug?

Mingyu


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to