Thanks Jay, is there any suggestion how much I can increase those parameters?
On Mon, 17 Jul 2023 at 8:25 PM, Jay <jayadeep.jayara...@gmail.com> wrote: > Fileoutputcommitter v2 is supported in GCS but the rename is a metadata > copy and delete operation in GCS and therefore if there are many number of > files it will take a long time to perform this step. One workaround will be > to create smaller number of larger files if that is possible from Spark and > if this is not possible then those configurations allow for configuring the > threadpool which does the metadata copy. > > You can go through this table > <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores> > to understand GCS performance implications. > > > > On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> You said this Hive table was a managed table partitioned by date >> -->${TODAY} >> >> How do you define your Hive managed table? >> >> HTH >> >> Mich Talebzadeh, >> Solutions Architect/Engineering Lead >> Palantir Technologies Limited >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com> >> wrote: >> >>> It does support- It doesn’t error out for me atleast. But it took around >>> 4 hours to finish the job. >>> >>> Interestingly, it took only 10 minutes to write the output in the >>> staging directory and rest of the time it took to rename the objects. Thats >>> the concern. >>> >>> Looks like a known issue as spark behaves with GCS but not getting any >>> workaround for this. >>> >>> >>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com> >>> wrote: >>> >>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is >>>> supported on GCS? IIRC it wasn't, but you could check with GCP support >>>> >>>> >>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <dev.dipaya...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Jay, >>>>> >>>>> I will try that option. >>>>> >>>>> Any insight on the file committer algorithms? >>>>> >>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the >>>>> best practice in Dataproc for dynamic updates in Spark. >>>>> >>>>> >>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com> >>>>> wrote: >>>>> >>>>>> You can try increasing fs.gs.batch.threads and >>>>>> fs.gs.max.requests.per.batch. >>>>>> >>>>>> The definitions for these flags are available here - >>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md >>>>>> >>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a >>>>>>> managed Hive table on top of this. >>>>>>> [image: image.png] >>>>>>> When I do a dynamic partition update of Spark, it creates the new >>>>>>> file in a Staging area as shown here. >>>>>>> But the GCS blob renaming takes a lot of time. I have a partition >>>>>>> based on dates and I need to update around 3 years of data. It usually >>>>>>> takes 3 hours to finish the process. Anyway to speed up this? >>>>>>> With Best Regards, >>>>>>> >>>>>>> Dipayan Dev >>>>>>> >>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> So you are using GCP and your Hive is installed on Dataproc which >>>>>>>> happens to run your Spark as well. Is that correct? >>>>>>>> >>>>>>>> What version of Hive are you using? >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> >>>>>>>> Mich Talebzadeh, >>>>>>>> Solutions Architect/Engineering Lead >>>>>>>> Palantir Technologies Limited >>>>>>>> London >>>>>>>> United Kingdom >>>>>>>> >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> Of late, I have encountered the issue where I have to overwrite a >>>>>>>>> lot of partitions of the Hive table through Spark. It looks like >>>>>>>>> writing to >>>>>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or >>>>>>>>> more >>>>>>>>> time goes in moving the ORC files from staging directory to the final >>>>>>>>> partitioned directory structure. >>>>>>>>> >>>>>>>>> I got some reference where it's mentioned to use this config >>>>>>>>> during the Spark write. >>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2* >>>>>>>>> >>>>>>>>> However, it's also mentioned it's not safe as partial job failure >>>>>>>>> might cause data loss. >>>>>>>>> >>>>>>>>> Is there any suggestion on the pros and cons of using this >>>>>>>>> version? Or any ongoing Spark feature development to address this >>>>>>>>> issue? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> With Best Regards, >>>>>>>>> >>>>>>>>> Dipayan Dev >>>>>>>>> >>>>>>>> -- >>>>> >>>>> >>>>> >>>>> With Best Regards, >>>>> >>>>> Dipayan Dev >>>>> Author of *Deep Learning with Hadoop >>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* >>>>> M.Tech (AI), IISc, Bangalore >>>>> >>>> -- >>> >>> >>> >>> With Best Regards, >>> >>> Dipayan Dev >>> Author of *Deep Learning with Hadoop >>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* >>> M.Tech (AI), IISc, Bangalore >>> >> -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* M.Tech (AI), IISc, Bangalore