Hi, Sorry to dig out this thread but this bug is still present.
The fix proposed in this thread (creating a new FileSystem implementation which sorts listed files) was rejected, with the suggestion that it is the FileInputFormat's responsibility to sort the file names if preserving partition order is desired: https://github.com/apache/spark/pull/4204 Given that Spark RDDs are supposed to preserve the order of the collections they represent, this would still deserve to be fixed in Spark, I think. As a user, I expect that if I use saveAsTextFile and then load the resulting file with sparkContext.textFile, I obtain a dataset in the same order. Because Spark uses the FileInputFormats exposed by Hadoop, that would mean either patching Hadoop for it to sort file names directly (which is likely going to fail since Hadoop might not care about the ordering in general), or create subclasses of all Hadoop formats used in Spark, adding the required sorting to the listStatus method. This strikes me as less elegant than implementing a new FileSystem as suggested by Reynold, though. Another way to "fix" this would be to mention in the docs that order is not preserved in this scenario, which could hopefully avoid bad surprises to others (just like we already have a caveat about nondeterminism of order after shuffles). I would be happy to try submitting a fix for this, if there is a consensus around the correct course of action. Cheers, Antonin -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org