Re: HIVE Query 25x faster than SPARK Query

Stephen Boesch Thu, 09 Jun 2016 09:11:51 -0700

ooc are the tables partitioned on a.pk and b.fk?  Hive might be using
copartitioning in that case: it is one of hive's strengths.


2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>:

> Hi Mich,
>
> does not Hive use map-reduce? I thought it to be so. And since I am
> running the queries in EMR 4.6 therefore HIVE is not using TEZ.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> are you using map-reduce with Hive?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> -----------------------
>>> ENVIRONMENT:
>>> -----------------------
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --------------
>>> QUERY1:
>>> --------------
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --------------
>>> QUERY 2:
>>> --------------
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>
>

Re: HIVE Query 25x faster than SPARK Query

Reply via email to