Re: Spark hive overwrite is very very slow

Jörn Franke Sun, 20 Aug 2017 10:47:18 -0700

Ah i see then I would check also directly in Hive if you have issues to insert 
data in the Hive table. Alternatively you can try to register the df as 
temptable and do a insert into the Hive table from the temptable using Spark 
sql ("insert into table hivetable select * from temptable")



You seem to use Cloudera so you probably have a very outdated Hive version. So 
you could switch to a distribution having a recent version of Hive 2 with 
Tez+llap - these are much more performant with much more features.

Alternatively you can try to register the df as temptable and do a insert into 
the Hive table from the temptable using Spark sql ("insert into table hivetable 
select * from temptable")

> On 20. Aug 2017, at 18:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I have created hive table in impala first with storage format as parquet. 
> With dataframe from spark I am tryinig to insert into the same table with 
> below syntax.
> 
> Table is partitoned by year,month,day 
> ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")
> 
> https://issues.apache.org/jira/browse/SPARK-20049
> 
> I saw something in the above link not sure if that is same thing in my case.
> 
> Thanks,
> Asmath
> 
>> On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>> Have you made sure that the saveastable stores them as parquet?
>> 
>>> On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> 
>>> wrote:
>>> 
>>> we are using parquet tables, is it causing any performance issue?
>>> 
>>>> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> Improving the performance of Hive can be also done by switching to 
>>>> Tez+llap as an engine.
>>>> Aside from this : you need to check what is the default format that it 
>>>> writes to Hive. One issue for the slow storing into a hive table could be 
>>>> that it writes by default to csv/gzip or csv/bzip2
>>>> 
>>>> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed 
>>>> > <mdkhajaasm...@gmail.com> wrote:
>>>> >
>>>> > Yes we tried hive and want to migrate to spark for better performance. I 
>>>> > am using paraquet tables . Still no better performance while loading.
>>>> >
>>>> > Sent from my iPhone
>>>> >
>>>> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> >>
>>>> >> Have you tried directly in Hive how the performance is?
>>>> >>
>>>> >> In which Format do you expect Hive to write? Have you made sure it is 
>>>> >> in this format? It could be that you use an inefficient format (e.g. 
>>>> >> CSV + bzip2).
>>>> >>
>>>> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed 
>>>> >>> <mdkhajaasm...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I have written spark sql job on spark2.0 by using scala . It is just 
>>>> >>> pulling the data from hive table and add extra columns , remove 
>>>> >>> duplicates and then write it back to hive again.
>>>> >>>
>>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of data. 
>>>> >>> Is there anything that I need to improve performance .
>>>> >>>
>>>> >>> Spark.sql.partitions is 2000 in my case with executor memory of 16gb 
>>>> >>> and dynamic allocation enabled.
>>>> >>>
>>>> >>> I am doing insert overwrite on partition by
>>>> >>> Da.write.mode(overwrite).insertinto(table)
>>>> >>>
>>>> >>> Any suggestions please ??
>>>> >>>
>>>> >>> Sent from my iPhone
>>>> >>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>> >>>
>>> 
>

Re: Spark hive overwrite is very very slow

Reply via email to