Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Michael Cutler Mon, 02 Jun 2014 02:32:30 -0700

The function saveAsTextFile
<https://github.com/apache/spark/blob/7d9cc9214bd06495f6838e355331dd2b5f1f7407/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1066>
is
a wrapper around saveAsHadoopFile
<https://github.com/apache/spark/blob/21570b463388194877003318317aafd842800cac/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L592>
and
from looking at the source I don't see any flags etc. to overwrite existing
files.  It is however trivial to do this using HDFS directly from Scala.


val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get(new
java.net.URI("hdfs://localhost:9000"), hadoopConf)


You can now use hdfs to do all sorts of useful things, listing directories,
recursively delete output directories e.g.

// Delete the existing path, ignore any exceptions thrown if the path
doesn't exist
val output = "hdfs://localhost:9000/tmp/wimbledon_top_mentions"
try { hdfs.delete(new org.apache.hadoop.fs.Path(output), true) } catch
{ case _ : Throwable => { } }
top_mentions.saveAsTextFile(output)


For an illustrated example of how I do this see HDFSDeleteExample.scala
<https://gist.github.com/cotdp/b3512dd1328f10ee9257>



*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com <mich...@tumra.com>Web:
    tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>*
*Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
*Registered in England & Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 2 June 2014 09:26, Pierre Borckmans <
pierre.borckm...@realimpactanalytics.com> wrote:

> +1 Same question here...
>
> Message sent from a mobile device - excuse typos and abbreviations
>
> Le 2 juin 2014 à 10:08, Kexin Xie <kexin....@bigcommerce.com> a écrit :
>
> Hi,
>
> Spark 1.0 changes the default behaviour of RDD.saveAsTextFile to
> throw org.apache.hadoop.mapred.FileAlreadyExistsException when file already
> exists.
>
> Is there a way I can allow Spark to overwrite the existing file?
>
> Cheers,
> Kexin
>
>

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Reply via email to