Yes. because in the second query, you did a (select PK from A) A . I guess it could the the subquery makes the results much smaller and make the broadcastJoin, so it is much faster.
you could use sql.describe() to check the execution plan. On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > I think if we try to see why is Query 2 faster than Query 1 then all the > answers will be given without beating around the bush. That is the right > way to find out what is happening and why. > > > Regards, > Gourav > > On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> Could you print out the sql execution plan? My guess is about broadcast >> join. >> >> >> >> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >> Hi, >> >> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here >> and is there a way we can optimize the queries in SPARK without the obvious >> hack in Query2. >> >> >> ----------------------- >> ENVIRONMENT: >> ----------------------- >> >> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >> million rows. Both the files are single gzipped csv file. >> > Both table A and B are external tables in AWS S3 and created in HIVE >> accessed through SPARK using HiveContext >> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >> allowMaximumResource allocation and node types are c3.4xlarge). >> >> -------------- >> QUERY1: >> -------------- >> select A.PK, B.FK >> from A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >> >> >> This query takes 4 mins in HIVE and 1.1 hours in SPARK >> >> >> -------------- >> QUERY 2: >> -------------- >> >> select A.PK, B.FK >> from (select PK from A) A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >> This query takes 4.5 mins in SPARK >> >> >> >> Regards, >> Gourav Sengupta >> >> >> >> >