Thanks guys. But the issue seems orthogonal to what output committer is used, no?
When writing out a dataframe as parquet, does the job recover if one task crashes mid-way, leaving a half written file? What we observe is that when the task is re-tried, it tries to open a "new" file of the same name, and fails since the half written file exists already. Thanks Vinoth On Fri, Mar 25, 2016 at 1:16 PM, Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Vinoth, > > As per documentation DirectParquetOutputCommitter better suits for S3. > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala > > Regards, > Surendra M > > -- Surendra Manchikanti > > On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar <vin...@uber.com> wrote: > >> Hi, >> >> We are doing the following to save a dataframe in parquet (using >> DirectParquetOutputCommitter) as follows. >> >> dfWriter.format("parquet") >> .mode(SaveMode.Overwrite) >> .save(outputPath) >> >> The problem is even if an executor fails once while writing file (say >> some transient HDFS issue), when its re-spawn, it fails again because the >> file exists already, eventually failing the entire job. >> >> Is this a known issue? Any workarounds? >> >> Thanks >> Vinoth >> > >