It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern?
On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev <dev.dipaya...@gmail.com> wrote: > Thanks Jay, is there any suggestion how much I can increase those > parameters? > > On Mon, 17 Jul 2023 at 8:25 PM, Jay <jayadeep.jayara...@gmail.com> wrote: > >> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata >> copy and delete operation in GCS and therefore if there are many number of >> files it will take a long time to perform this step. One workaround will be >> to create smaller number of larger files if that is possible from Spark and >> if this is not possible then those configurations allow for configuring the >> threadpool which does the metadata copy. >> >> You can go through this table >> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores> >> to understand GCS performance implications. >> >> >> >> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> You said this Hive table was a managed table partitioned by date >>> -->${TODAY} >>> >>> How do you define your Hive managed table? >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com> >>> wrote: >>> >>>> It does support- It doesn’t error out for me atleast. But it took >>>> around 4 hours to finish the job. >>>> >>>> Interestingly, it took only 10 minutes to write the output in the >>>> staging directory and rest of the time it took to rename the objects. Thats >>>> the concern. >>>> >>>> Looks like a known issue as spark behaves with GCS but not getting any >>>> workaround for this. >>>> >>>> >>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com> >>>> wrote: >>>> >>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is >>>>> supported on GCS? IIRC it wasn't, but you could check with GCP support >>>>> >>>>> >>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <dev.dipaya...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Jay, >>>>>> >>>>>> I will try that option. >>>>>> >>>>>> Any insight on the file committer algorithms? >>>>>> >>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the >>>>>> best practice in Dataproc for dynamic updates in Spark. >>>>>> >>>>>> >>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> You can try increasing fs.gs.batch.threads and >>>>>>> fs.gs.max.requests.per.batch. >>>>>>> >>>>>>> The definitions for these flags are available here - >>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md >>>>>>> >>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a >>>>>>>> managed Hive table on top of this. >>>>>>>> [image: image.png] >>>>>>>> When I do a dynamic partition update of Spark, it creates the new >>>>>>>> file in a Staging area as shown here. >>>>>>>> But the GCS blob renaming takes a lot of time. I have a partition >>>>>>>> based on dates and I need to update around 3 years of data. It usually >>>>>>>> takes 3 hours to finish the process. Anyway to speed up this? >>>>>>>> With Best Regards, >>>>>>>> >>>>>>>> Dipayan Dev >>>>>>>> >>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>>> So you are using GCP and your Hive is installed on Dataproc which >>>>>>>>> happens to run your Spark as well. Is that correct? >>>>>>>>> >>>>>>>>> What version of Hive are you using? >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Solutions Architect/Engineering Lead >>>>>>>>> Palantir Technologies Limited >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>>> which may >>>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>>> damages >>>>>>>>> arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> Of late, I have encountered the issue where I have to overwrite a >>>>>>>>>> lot of partitions of the Hive table through Spark. It looks like >>>>>>>>>> writing to >>>>>>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or >>>>>>>>>> more >>>>>>>>>> time goes in moving the ORC files from staging directory to the final >>>>>>>>>> partitioned directory structure. >>>>>>>>>> >>>>>>>>>> I got some reference where it's mentioned to use this config >>>>>>>>>> during the Spark write. >>>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2* >>>>>>>>>> >>>>>>>>>> However, it's also mentioned it's not safe as partial job failure >>>>>>>>>> might cause data loss. >>>>>>>>>> >>>>>>>>>> Is there any suggestion on the pros and cons of using this >>>>>>>>>> version? Or any ongoing Spark feature development to address this >>>>>>>>>> issue? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> With Best Regards, >>>>>>>>>> >>>>>>>>>> Dipayan Dev >>>>>>>>>> >>>>>>>>> -- >>>>>> >>>>>> >>>>>> >>>>>> With Best Regards, >>>>>> >>>>>> Dipayan Dev >>>>>> Author of *Deep Learning with Hadoop >>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* >>>>>> M.Tech (AI), IISc, Bangalore >>>>>> >>>>> -- >>>> >>>> >>>> >>>> With Best Regards, >>>> >>>> Dipayan Dev >>>> Author of *Deep Learning with Hadoop >>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* >>>> M.Tech (AI), IISc, Bangalore >>>> >>> -- > > > > With Best Regards, > > Dipayan Dev > Author of *Deep Learning with Hadoop > <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* > M.Tech (AI), IISc, Bangalore > -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* M.Tech (AI), IISc, Bangalore