Thanks guys. But the issue seems orthogonal to what output committer is
used, no?

When writing out a dataframe as parquet, does the job recover if one task
crashes mid-way, leaving a half written file? What we observe is that when
the task is re-tried, it tries to open a "new" file of the same name, and
fails since the half written file exists already.

Thanks
Vinoth

On Fri, Mar 25, 2016 at 1:16 PM, Surendra , Manchikanti <
surendra.manchika...@gmail.com> wrote:

> Hi Vinoth,
>
> As per documentation DirectParquetOutputCommitter better suits for S3.
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala
>
> Regards,
> Surendra M
>
> -- Surendra Manchikanti
>
> On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar <vin...@uber.com> wrote:
>
>> Hi,
>>
>> We are doing the following to save a dataframe in parquet (using
>> DirectParquetOutputCommitter) as follows.
>>
>> dfWriter.format("parquet")
>>   .mode(SaveMode.Overwrite)
>>   .save(outputPath)
>>
>> The problem is even if an executor fails once while writing file (say
>> some transient HDFS issue), when its re-spawn, it fails again because the
>> file exists already, eventually failing the entire job.
>>
>> Is this a known issue? Any workarounds?
>>
>> Thanks
>> Vinoth
>>
>
>

Reply via email to