Hi,

Sorry to dig out this thread but this bug is still present.

The fix proposed in this thread (creating a new FileSystem implementation
which sorts listed files) was rejected, with the suggestion that it is the
FileInputFormat's responsibility to sort the file names if preserving
partition order is desired:
https://github.com/apache/spark/pull/4204

Given that Spark RDDs are supposed to preserve the order of the collections
they represent, this would still deserve to be fixed in Spark, I think. As a
user, I expect that if I use saveAsTextFile and then load the resulting file
with sparkContext.textFile, I obtain a dataset in the same order.

Because Spark uses the FileInputFormats exposed by Hadoop, that would mean
either patching Hadoop for it to sort file names directly (which is likely
going to fail since Hadoop might not care about the ordering in general), or
create subclasses of all Hadoop formats used in Spark, adding the required
sorting to the listStatus method. This strikes me as less elegant than
implementing a new FileSystem as suggested by Reynold, though.

Another way to "fix" this would be to mention in the docs that order is not
preserved in this scenario, which could hopefully avoid bad surprises to
others (just like we already have a caveat about nondeterminism of order
after shuffles).

I would be happy to try submitting a fix for this, if there is a consensus
around the correct course of action.

Cheers,
Antonin



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to