Re: Spark File Output Committer algorithm for GCS

Dipayan Dev Mon, 17 Jul 2023 06:54:02 -0700

Thanks Jay,

I will try that option.


Any insight on the file committer algorithms?

I tried v2 algorithm but its not enhancing the runtime. What’s the best
practice in Dataproc for dynamic updates in Spark.


On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com> wrote:

> You can try increasing fs.gs.batch.threads and
> fs.gs.max.requests.per.batch.
>
> The definitions for these flags are available here -
> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>
> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> wrote:
>
>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
>> Hive table on top of this.
>> [image: image.png]
>> When I do a dynamic partition update of Spark, it creates the new file in
>> a Staging area as shown here.
>> But the GCS blob renaming takes a lot of time. I have a partition based
>> on dates and I need to update around 3 years of data. It usually takes 3
>> hours to finish the process. Anyway to speed up this?
>> With Best Regards,
>>
>> Dipayan Dev
>>
>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> So you are using GCP and your Hive is installed on Dataproc which
>>> happens to run your Spark as well. Is that correct?
>>>
>>> What version of Hive are you using?
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Of late, I have encountered the issue where I have to overwrite a lot
>>>> of partitions of the Hive table through Spark. It looks like writing to
>>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>>> time goes in moving the ORC files from staging directory to the final
>>>> partitioned directory structure.
>>>>
>>>> I got some reference where it's mentioned to use this config during the
>>>> Spark write.
>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>
>>>> However, it's also mentioned it's not safe as partial job failure might
>>>> cause data loss.
>>>>
>>>> Is there any suggestion on the pros and cons of using this version? Or
>>>> any ongoing Spark feature development to address this issue?
>>>>
>>>>
>>>>
>>>> With Best Regards,
>>>>
>>>> Dipayan Dev
>>>>
>>> --



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore

Re: Spark File Output Committer algorithm for GCS

Reply via email to