Hi All, Of late, I have encountered the issue where I have to overwrite a lot of partitions of the Hive table through Spark. It looks like writing to hive_staging_directory takes 25% of the total time, whereas 75% or more time goes in moving the ORC files from staging directory to the final partitioned directory structure.
I got some reference where it's mentioned to use this config during the Spark write. *mapreduce.fileoutputcommitter.algorithm.version = 2* However, it's also mentioned it's not safe as partial job failure might cause data loss. Is there any suggestion on the pros and cons of using this version? Or any ongoing Spark feature development to address this issue? With Best Regards, Dipayan Dev