Re: Missing data in spark output

2022-10-25 Thread Steve Loughran
v1 on gcs isn't safe either as promotion from task attempt to successful task is a dir rename; fast and atomic on hdfs, O(files) and nonatomic on GCS. if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be there to test https://issues.apache.org/jira/browse/MAPREDUCE-7341 unt

Re: Missing data in spark output

2022-10-21 Thread Chris Nauroth
Some users have observed issues like what you're describing related to the job commit algorithm, which is controlled by configuration property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version. Hadoop's default value for this setting is 2. You can find a description of the algorithms in

Re: Missing data in spark output

2022-10-19 Thread Martin Andersson
Is your spark job batch or streaming? From: Sandeep Vinayak Sent: Tuesday, October 18, 2022 19:48 To: dev@spark.apache.org Subject: Missing data in spark output EXTERNAL SENDER. Do not click links or open attachments unless you recognize the sender and know the

Re: Missing data in spark output

2022-10-18 Thread Emil Ejbyfeldt
Hi, We have observed similar behavior in older versions of spark. But we were are currently using 3.3.0 where we have not seen such issues. Which version of Spark and Hadoop are you using? On 18/10/2022 19:48, Sandeep Vinayak wrote: Hello Everyone, We are recently observing an intermittent