Re: Spark File Output Committer algorithm for GCS

Dipayan Dev Tue, 18 Jul 2023 10:24:39 -0700

It does help performance but not significantly.

I am just wondering, once Spark creates that staging directory along with
the SUCCESS file, can we just do a gsutil rsync command and move these
files to original directory? Anyone tried this approach or foresee any
concern?




On Mon, 17 Jul 2023 at 9:47 PM, Dipayan Dev <[email protected]> wrote:

> Thanks Jay, is there any suggestion how much I can increase those
> parameters?
>
> On Mon, 17 Jul 2023 at 8:25 PM, Jay <[email protected]> wrote:
>
>> Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
>> copy and delete operation in GCS and therefore if there are many number of
>> files it will take a long time to perform this step. One workaround will be
>> to create smaller number of larger files if that is possible from Spark and
>> if this is not possible then those configurations allow for configuring the
>> threadpool which does the metadata copy.
>>
>> You can go through this table
>> <https://spark.apache.org/docs/latest/cloud-integration.html#recommended-settings-for-writing-to-object-stores>
>> to understand GCS performance implications.
>>
>>
>>
>> On Mon, 17 Jul 2023 at 20:12, Mich Talebzadeh <[email protected]>
>> wrote:
>>
>>> You said this Hive table was a managed table partitioned by date
>>> -->${TODAY}
>>>
>>> How  do you define your Hive managed table?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <[email protected]>
>>> wrote:
>>>
>>>> It does support- It doesn’t error out for me atleast. But it took
>>>> around 4 hours to finish the job.
>>>>
>>>> Interestingly, it took only 10 minutes to write the output in the
>>>> staging directory and rest of the time it took to rename the objects. Thats
>>>> the concern.
>>>>
>>>> Looks like a known issue as spark behaves with GCS but not getting any
>>>> workaround for this.
>>>>
>>>>
>>>> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <[email protected]>
>>>> wrote:
>>>>
>>>>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>>>>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>>>>
>>>>>
>>>>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Jay,
>>>>>>
>>>>>> I will try that option.
>>>>>>
>>>>>> Any insight on the file committer algorithms?
>>>>>>
>>>>>> I tried v2 algorithm but its not enhancing the runtime. What’s the
>>>>>> best practice in Dataproc for dynamic updates in Spark.
>>>>>>
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> You can try increasing fs.gs.batch.threads and
>>>>>>> fs.gs.max.requests.per.batch.
>>>>>>>
>>>>>>> The definitions for these flags are available here -
>>>>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>>>>
>>>>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>>>>>> managed Hive table on top of this.
>>>>>>>> [image: image.png]
>>>>>>>> When I do a dynamic partition update of Spark, it creates the new
>>>>>>>> file in a Staging area as shown here.
>>>>>>>> But the GCS blob renaming takes a lot of time. I have a partition
>>>>>>>> based on dates and I need to update around 3 years of data. It usually
>>>>>>>> takes 3 hours to finish the process. Anyway to speed up this?
>>>>>>>> With Best Regards,
>>>>>>>>
>>>>>>>> Dipayan Dev
>>>>>>>>
>>>>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> So you are using GCP and your Hive is installed on Dataproc which
>>>>>>>>> happens to run your Spark as well. Is that correct?
>>>>>>>>>
>>>>>>>>> What version of Hive are you using?
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Solutions Architect/Engineering Lead
>>>>>>>>> Palantir Technologies Limited
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Of late, I have encountered the issue where I have to overwrite a
>>>>>>>>>> lot of partitions of the Hive table through Spark. It looks like 
>>>>>>>>>> writing to
>>>>>>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or 
>>>>>>>>>> more
>>>>>>>>>> time goes in moving the ORC files from staging directory to the final
>>>>>>>>>> partitioned directory structure.
>>>>>>>>>>
>>>>>>>>>> I got some reference where it's mentioned to use this config
>>>>>>>>>> during the Spark write.
>>>>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>>>>>
>>>>>>>>>> However, it's also mentioned it's not safe as partial job failure
>>>>>>>>>> might cause data loss.
>>>>>>>>>>
>>>>>>>>>> Is there any suggestion on the pros and cons of using this
>>>>>>>>>> version? Or any ongoing Spark feature development to address this 
>>>>>>>>>> issue?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> With Best Regards,
>>>>>>>>>>
>>>>>>>>>> Dipayan Dev
>>>>>>>>>>
>>>>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>> With Best Regards,
>>>>>>
>>>>>> Dipayan Dev
>>>>>> Author of *Deep Learning with Hadoop
>>>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>>>> M.Tech (AI), IISc, Bangalore
>>>>>>
>>>>> --
>>>>
>>>>
>>>>
>>>> With Best Regards,
>>>>
>>>> Dipayan Dev
>>>> Author of *Deep Learning with Hadoop
>>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>>> M.Tech (AI), IISc, Bangalore
>>>>
>>> --
>
>
>
> With Best Regards,
>
> Dipayan Dev
> Author of *Deep Learning with Hadoop
> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
> M.Tech (AI), IISc, Bangalore
>
-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore

Re: Spark File Output Committer algorithm for GCS

Reply via email to