Re: Spark hive overwrite is very very slow

KhajaAsmath Mohammed Sun, 20 Aug 2017 10:54:28 -0700

We are in cloudera CDH5.10 and we are using spark 2 that comes with
cloudera.


Coming to second solution, creating a temporary view on dataframe but it
didnt improve my performance too.

I do remember performance was very fast when doing whole overwrite table
without partitons but the problem started after using partitions.

On Sun, Aug 20, 2017 at 12:46 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Ah i see then I would check also directly in Hive if you have issues to
> insert data in the Hive table. Alternatively you can try to register the
> df as temptable and do a insert into the Hive table from the temptable
> using Spark sql ("insert into table hivetable select * from temptable")
>
>
> You seem to use Cloudera so you probably have a very outdated Hive
> version. So you could switch to a distribution having a recent version of
> Hive 2 with Tez+llap - these are much more performant with much more
> features.
>
> Alternatively you can try to register the df as temptable and do a insert
> into the Hive table from the temptable using Spark sql ("insert into table
> hivetable select * from temptable")
>
> On 20. Aug 2017, at 18:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
> wrote:
>
> Hi,
>
> I have created hive table in impala first with storage format as parquet.
> With dataframe from spark I am tryinig to insert into the same table with
> below syntax.
>
> Table is partitoned by year,month,day
> ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")
>
> https://issues.apache.org/jira/browse/SPARK-20049
>
> I saw something in the above link not sure if that is same thing in my
> case.
>
> Thanks,
> Asmath
>
> On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> Have you made sure that the saveastable stores them as parquet?
>>
>> On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>> wrote:
>>
>> we are using parquet tables, is it causing any performance issue?
>>
>> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> Improving the performance of Hive can be also done by switching to
>>> Tez+llap as an engine.
>>> Aside from this : you need to check what is the default format that it
>>> writes to Hive. One issue for the slow storing into a hive table could be
>>> that it writes by default to csv/gzip or csv/bzip2
>>>
>>> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>> >
>>> > Yes we tried hive and want to migrate to spark for better performance.
>>> I am using paraquet tables . Still no better performance while loading.
>>> >
>>> > Sent from my iPhone
>>> >
>>> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>> >>
>>> >> Have you tried directly in Hive how the performance is?
>>> >>
>>> >> In which Format do you expect Hive to write? Have you made sure it is
>>> in this format? It could be that you use an inefficient format (e.g. CSV +
>>> bzip2).
>>> >>
>>> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I have written spark sql job on spark2.0 by using scala . It is just
>>> pulling the data from hive table and add extra columns , remove duplicates
>>> and then write it back to hive again.
>>> >>>
>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of data.
>>> Is there anything that I need to improve performance .
>>> >>>
>>> >>> Spark.sql.partitions is 2000 in my case with executor memory of 16gb
>>> and dynamic allocation enabled.
>>> >>>
>>> >>> I am doing insert overwrite on partition by
>>> >>> Da.write.mode(overwrite).insertinto(table)
>>> >>>
>>> >>> Any suggestions please ??
>>> >>>
>>> >>> Sent from my iPhone
>>> >>> ------------------------------------------------------------
>>> ---------
>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >>>
>>>
>>
>>
>

Re: Spark hive overwrite is very very slow

Reply via email to