Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Patrick Wendell Mon, 02 Jun 2014 12:03:58 -0700

Hey There,

The issue was that the old behavior could cause users to silently
overwrite data, which is pretty bad, so to be conservative we decided
to enforce the same checks that Hadoop does.


This was documented by this JIRA:
https://issues.apache.org/jira/browse/SPARK-1100
https://github.com/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1

However, it would be very easy to add an option that allows preserving
the old behavior. Is anyone here interested in contributing that? I
created a JIRA for it:

https://issues.apache.org/jira/browse/SPARK-1993

- Patrick

On Mon, Jun 2, 2014 at 9:22 AM, Pierre Borckmans
<pierre.borckm...@realimpactanalytics.com> wrote:
> Indeed, the behavior has changed for good or for bad. I mean, I agree with
> the danger you mention but I'm not sure it's happening like that. Isn't
> there a mechanism for overwrite in Hadoop that automatically removes part
> files, then writes a _temporary folder and then only the part files along
> with the _success folder.
>
> In any case this change of behavior should be documented IMO.
>
> Cheers
> Pierre
>
> Message sent from a mobile device - excuse typos and abbreviations
>
> Le 2 juin 2014 à 17:42, Nicholas Chammas <nicholas.cham...@gmail.com> a
> écrit :
>
> What I've found using saveAsTextFile() against S3 (prior to Spark 1.0.0.) is
> that files get overwritten automatically. This is one danger to this though.
> If I save to a directory that already has 20 part- files, but this time
> around I'm only saving 15 part- files, then there will be 5 leftover part-
> files from the previous set mixed in with the 15 newer files. This is
> potentially dangerous.
>
> I haven't checked to see if this behavior has changed in 1.0.0. Are you
> saying it has, Pierre?
>
> On Mon, Jun 2, 2014 at 9:41 AM, Pierre B
> [pierre.borckm...@realimpactanalytics.com](mailto:pierre.borckm...@realimpactanalytics.com)
> wrote:
>>
>> Hi Michaël,
>>
>> Thanks for this. We could indeed do that.
>>
>> But I guess the question is more about the change of behaviour from 0.9.1
>> to
>> 1.0.0.
>> We never had to care about that in previous versions.
>>
>> Does that mean we have to manually remove existing files or is there a way
>> to "aumotically" overwrite when using saveAsTextFile?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

Reply via email to