Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-07-01 Thread Steve Loughran
https://issues.apache.org/jira/browse/MAPREDUCE-7282 "MR v2 commit algorithm is dangerous, should be deprecated and not the default" someone do a PR to change the default & if it doesn't break too much I'l merge it On Mon, 29 Jun 2020 at 13:20, Steve Loughran wrote: > v2 does a file-by-file

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-29 Thread Steve Loughran
v2 does a file-by-file copy to the dest dir in task commit; v1 promotes task attempts to job attempt dir by dir rename, job commit lists those and moves the contents if the worker fails during task commit -the next task attempt has to replace every file -so it had better use the same filenames. T

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Waleed Fateem
I was trying to make my email short and concise, but the rationale behind setting that as 1 by default is because it's safer. With algorithm version 2 you run the risk of having bad data in cases where tasks fail or even duplicate data if a task fails and succeeds on a reattempt (I don't know if th

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Sean Owen
I think is a Hadoop property that is just passed through? if the default is different in Hadoop 3 we could mention that in the docs. i don't know if we want to always set it to 1 as a Spark default, even in Hadoop 3 right? On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem wrote: > > Hello! > > I noti