Hi, Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2.
----------------------- ENVIRONMENT: ----------------------- > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million rows. Both the files are single gzipped csv file. > Both table A and B are external tables in AWS S3 and created in HIVE accessed through SPARK using HiveContext > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using allowMaximumResource allocation and node types are c3.4xlarge). -------------- QUERY1: -------------- select A.PK, B.FK from A left outer join B on (A.PK = B.FK) where B.FK is not null; This query takes 4 mins in HIVE and 1.1 hours in SPARK -------------- QUERY 2: -------------- select A.PK, B.FK from (select PK from A) A left outer join B on (A.PK = B.FK) where B.FK is not null; This query takes 4.5 mins in SPARK Regards, Gourav Sengupta