Re: Spark File Output Committer algorithm for GCS

Dipayan Dev Mon, 17 Jul 2023 09:19:30 -0700

Thanks Jay, is there any suggestion how much I can increase those
parameters?


On Mon, 17 Jul 2023 at 8:25 PM, Jay <jayadeep.jayara...@gmail.com> wrote:

> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
> copy and delete operation in GCS and therefore if there are many number of
> files it will take a long time to perform this step. One workaround will be
> to create smaller number of larger files if that is possible from Spark and
> if this is not possible then those configurations allow for configuring the
> threadpool which does the metadata copy.
>
> You can go through this table
> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
> to understand GCS performance implications.
>
>
>
> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> You said this Hive table was a managed table partitioned by date
>> -->${TODAY}
>>
>> How  do you define your Hive managed table?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com>
>> wrote:
>>
>>> It does support- It doesn’t error out for me atleast. But it took around
>>> 4 hours to finish the job.
>>>
>>> Interestingly, it took only 10 minutes to write the output in the
>>> staging directory and rest of the time it took to rename the objects. Thats
>>> the concern.
>>>
>>> Looks like a known issue as spark behaves with GCS but not getting any
>>> workaround for this.
>>>
>>>
>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com>
>>> wrote:
>>>
>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>>>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>>>
>>>>
>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <dev.dipaya...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Jay,
>>>>>
>>>>> I will try that option.
>>>>>
>>>>> Any insight on the file committer algorithms?
>>>>>
>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the
>>>>> best practice in Dataproc for dynamic updates in Spark.
>>>>>
>>>>>
>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>> fs.gs.max.requests.per.batch.
>>>>>>
>>>>>> The definitions for these flags are available here -
>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>>>>> managed Hive table on top of this.
>>>>>>> [image: image.png]
>>>>>>> When I do a dynamic partition update of Spark, it creates the new
>>>>>>> file in a Staging area as shown here.
>>>>>>> But the GCS blob renaming takes a lot of time. I have a partition
>>>>>>> based on dates and I need to update around 3 years of data. It usually
>>>>>>> takes 3 hours to finish the process. Anyway to speed up this?
>>>>>>> With Best Regards,
>>>>>>>
>>>>>>> Dipayan Dev
>>>>>>>
>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> So you are using GCP and your Hive is installed on Dataproc which
>>>>>>>> happens to run your Spark as well. Is that correct?
>>>>>>>>
>>>>>>>> What version of Hive are you using?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>> Palantir Technologies Limited
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Of late, I have encountered the issue where I have to overwrite a
>>>>>>>>> lot of partitions of the Hive table through Spark. It looks like 
>>>>>>>>> writing to
>>>>>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or 
>>>>>>>>> more
>>>>>>>>> time goes in moving the ORC files from staging directory to the final
>>>>>>>>> partitioned directory structure.
>>>>>>>>>
>>>>>>>>> I got some reference where it's mentioned to use this config
>>>>>>>>> during the Spark write.
>>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>>>>
>>>>>>>>> However, it's also mentioned it's not safe as partial job failure
>>>>>>>>> might cause data loss.
>>>>>>>>>
>>>>>>>>> Is there any suggestion on the pros and cons of using this
>>>>>>>>> version? Or any ongoing Spark feature development to address this 
>>>>>>>>> issue?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With Best Regards,
>>>>>>>>>
>>>>>>>>> Dipayan Dev
>>>>>>>>>
>>>>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> With Best Regards,
>>>>>
>>>>> Dipayan Dev
>>>>> Author of *Deep Learning with Hadoop
>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>> M.Tech (AI), IISc, Bangalore
>>>>>
>>>> --
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>> Author of *Deep Learning with Hadoop
>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>> M.Tech (AI), IISc, Bangalore
>>>
>> --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore

Re: Spark File Output Committer algorithm for GCS

Reply via email to