Hi,

Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
and is there a way we can optimize the queries in SPARK without the obvious
hack in Query2.


-----------------------
ENVIRONMENT:
-----------------------

> Table A 533 columns x 24 million rows and Table B has 2 columns x 3
million rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE
accessed through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
allowMaximumResource allocation and node types are c3.4xlarge).

--------------
QUERY1:
--------------
select A.PK, B.FK
from A
left outer join B on (A.PK = B.FK)
where B.FK is not null;



This query takes 4 mins in HIVE and 1.1 hours in SPARK


--------------
QUERY 2:
--------------

select A.PK, B.FK
from (select PK from A) A
left outer join B on (A.PK = B.FK)
where B.FK is not null;

This query takes 4.5 mins in SPARK



Regards,
Gourav Sengupta

Reply via email to