Re: Spark hive overwrite is very very slow

Jörn Franke Sun, 20 Aug 2017 07:10:28 -0700

Improving the performance of Hive can be also done by switching to Tez+llap as 
an engine.
Aside from this : you need to check what is the default format that it writes 
to Hive. One issue for the slow storing into a hive table could be that it 
writes by default to csv/gzip or csv/bzip2


> On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
> wrote:
> 
> Yes we tried hive and want to migrate to spark for better performance. I am 
> using paraquet tables . Still no better performance while loading. 
> 
> Sent from my iPhone
> 
>> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>> 
>> Have you tried directly in Hive how the performance is? 
>> 
>> In which Format do you expect Hive to write? Have you made sure it is in 
>> this format? It could be that you use an inefficient format (e.g. CSV + 
>> bzip2).
>> 
>>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> I have written spark sql job on spark2.0 by using scala . It is just 
>>> pulling the data from hive table and add extra columns , remove duplicates 
>>> and then write it back to hive again.
>>> 
>>> In spark ui, it is taking almost 40 minutes to write 400 go of data. Is 
>>> there anything that I need to improve performance .
>>> 
>>> Spark.sql.partitions is 2000 in my case with executor memory of 16gb and 
>>> dynamic allocation enabled.
>>> 
>>> I am doing insert overwrite on partition by
>>> Da.write.mode(overwrite).insertinto(table)
>>> 
>>> Any suggestions please ??
>>> 
>>> Sent from my iPhone
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark hive overwrite is very very slow

Reply via email to