No, I am using Spark 2.4 to update the GCS partitions . I have a managed Hive table on top of this. [image: image.png] When I do a dynamic partition update of Spark, it creates the new file in a Staging area as shown here. But the GCS blob renaming takes a lot of time. I have a partition based on dates and I need to update around 3 years of data. It usually takes 3 hours to finish the process. Anyway to speed up this? With Best Regards,
Dipayan Dev On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > So you are using GCP and your Hive is installed on Dataproc which happens > to run your Spark as well. Is that correct? > > What version of Hive are you using? > > HTH > > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> wrote: > >> Hi All, >> >> Of late, I have encountered the issue where I have to overwrite a lot of >> partitions of the Hive table through Spark. It looks like writing to >> hive_staging_directory takes 25% of the total time, whereas 75% or more >> time goes in moving the ORC files from staging directory to the final >> partitioned directory structure. >> >> I got some reference where it's mentioned to use this config during the >> Spark write. >> *mapreduce.fileoutputcommitter.algorithm.version = 2* >> >> However, it's also mentioned it's not safe as partial job failure might >> cause data loss. >> >> Is there any suggestion on the pros and cons of using this version? Or >> any ongoing Spark feature development to address this issue? >> >> >> >> With Best Regards, >> >> Dipayan Dev >> >