Thanks guys. But the issue seems orthogonal to what output committer is
used, no?
When writing out a dataframe as parquet, does the job recover if one task
crashes mid-way, leaving a half written file? What we observe is that when
the task is re-tried, it tries to open a "new" file of the same nam
Hi Vinoth,
As per documentation DirectParquetOutputCommitter better suits for S3.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala
Regards,
Surendra M
-- Surendra Manchikanti
On Fri, Mar 25
I would not recommend using the direct output committer with HDFS. Its
intended only as an optimization for S3.
On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar wrote:
> Hi,
>
> We are doing the following to save a dataframe in parquet (using
> DirectParquetOutputCommitter) as follows.
>
> dfWri