Spark File Output Committer algorithm for GCS

Dipayan Dev Mon, 17 Jul 2023 01:15:17 -0700

Hi All,

Of late, I have encountered the issue where I have to overwrite a lot of
partitions of the Hive table through Spark. It looks like writing to
hive_staging_directory takes 25% of the total time, whereas 75% or more
time goes in moving the ORC files from staging directory to the final
partitioned directory structure.


I got some reference where it's mentioned to use this config during the
Spark write.
*mapreduce.fileoutputcommitter.algorithm.version = 2*

However, it's also mentioned it's not safe as partial job failure might
cause data loss.

Is there any suggestion on the pros and cons of using this version? Or any
ongoing Spark feature development to address this issue?



With Best Regards,

Dipayan Dev

Spark File Output Committer algorithm for GCS

Reply via email to