Internally, saveAsTextFile uses saveAsHadoopFile: https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala .
The final bit in the method first creates the output path and then saves the data set. However, if there is an issue with the saveAsHadoopDataset call, the path still remains. Technically, we could add an exception-handling section that removes the path in case of problems. I think that would be a nice way of making sure that we don’t litter the FS with empty files and directories in case of exceptions. So, to your question: parameter to saveAsTextFile is a path (not a file) and it has to be empty. Spark automatically names the files PART-N with N the partition number. This follows immediately from the partitioning scheme of the RDD itself. The real problem is that there is a problem with the calculation. You might want to fix that first. Just post the relevant bits from the log. Hi all: I’ve tried to execute something as below: result.map(transform).saveAsTextFile(hdfsAddress) Result is a RDD caluculated from mlilib algorithm. I submit this to yarn, and after two attempts , the application failed. But the exception in log is very missleading. It said hdfsAddress already exits. Actually, the first attempt log showed that the exception is from the calculation of result. Though the attempt failed it created the file. And then attempt 2 began with exception ‘file already exists’. Why was RDD calculation before already failed but also the file created? That’s not so good I think.