Re: Spark File Output Committer algorithm for GCS

Dipayan Dev Mon, 17 Jul 2023 02:28:50 -0700

No, I am using Spark 2.4 to update the GCS partitions . I have a managed
Hive table on top of this.
[image: image.png]
When I do a dynamic partition update of Spark, it creates the new file in a
Staging area as shown here.
But the GCS blob renaming takes a lot of time. I have a partition based on
dates and I need to update around 3 years of data. It usually takes 3 hours
to finish the process. Anyway to speed up this?
With Best Regards,


Dipayan Dev

On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <[email protected]>
wrote:

> So you are using GCP and your Hive is installed on Dataproc which happens
> to run your Spark as well. Is that correct?
>
> What version of Hive are you using?
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <[email protected]> wrote:
>
>> Hi All,
>>
>> Of late, I have encountered the issue where I have to overwrite a lot of
>> partitions of the Hive table through Spark. It looks like writing to
>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>> time goes in moving the ORC files from staging directory to the final
>> partitioned directory structure.
>>
>> I got some reference where it's mentioned to use this config during the
>> Spark write.
>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>
>> However, it's also mentioned it's not safe as partial job failure might
>> cause data loss.
>>
>> Is there any suggestion on the pros and cons of using this version? Or
>> any ongoing Spark feature development to address this issue?
>>
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>

Re: Spark File Output Committer algorithm for GCS

Reply via email to