Thanks Jay, I will try that option.
Any insight on the file committer algorithms? I tried v2 algorithm but its not enhancing the runtime. What’s the best practice in Dataproc for dynamic updates in Spark. On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com> wrote: > You can try increasing fs.gs.batch.threads and > fs.gs.max.requests.per.batch. > > The definitions for these flags are available here - > https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md > > On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> wrote: > >> No, I am using Spark 2.4 to update the GCS partitions . I have a managed >> Hive table on top of this. >> [image: image.png] >> When I do a dynamic partition update of Spark, it creates the new file in >> a Staging area as shown here. >> But the GCS blob renaming takes a lot of time. I have a partition based >> on dates and I need to update around 3 years of data. It usually takes 3 >> hours to finish the process. Anyway to speed up this? >> With Best Regards, >> >> Dipayan Dev >> >> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> So you are using GCP and your Hive is installed on Dataproc which >>> happens to run your Spark as well. Is that correct? >>> >>> What version of Hive are you using? >>> >>> HTH >>> >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> Of late, I have encountered the issue where I have to overwrite a lot >>>> of partitions of the Hive table through Spark. It looks like writing to >>>> hive_staging_directory takes 25% of the total time, whereas 75% or more >>>> time goes in moving the ORC files from staging directory to the final >>>> partitioned directory structure. >>>> >>>> I got some reference where it's mentioned to use this config during the >>>> Spark write. >>>> *mapreduce.fileoutputcommitter.algorithm.version = 2* >>>> >>>> However, it's also mentioned it's not safe as partial job failure might >>>> cause data loss. >>>> >>>> Is there any suggestion on the pros and cons of using this version? Or >>>> any ongoing Spark feature development to address this issue? >>>> >>>> >>>> >>>> With Best Regards, >>>> >>>> Dipayan Dev >>>> >>> -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* M.Tech (AI), IISc, Bangalore