ooc are the tables partitioned on a.pk and b.fk? Hive might be using copartitioning in that case: it is one of hive's strengths.
2016-06-09 7:28 GMT-07:00 Gourav Sengupta <gourav.sengu...@gmail.com>: > Hi Mich, > > does not Hive use map-reduce? I thought it to be so. And since I am > running the queries in EMR 4.6 therefore HIVE is not using TEZ. > > > Regards, > Gourav > > On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > >> are you using map-reduce with Hive? >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening >>> here and is there a way we can optimize the queries in SPARK without the >>> obvious hack in Query2. >>> >>> >>> ----------------------- >>> ENVIRONMENT: >>> ----------------------- >>> >>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>> million rows. Both the files are single gzipped csv file. >>> > Both table A and B are external tables in AWS S3 and created in HIVE >>> accessed through SPARK using HiveContext >>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>> allowMaximumResource allocation and node types are c3.4xlarge). >>> >>> -------------- >>> QUERY1: >>> -------------- >>> select A.PK, B.FK >>> from A >>> left outer join B on (A.PK = B.FK) >>> where B.FK is not null; >>> >>> >>> >>> This query takes 4 mins in HIVE and 1.1 hours in SPARK >>> >>> >>> -------------- >>> QUERY 2: >>> -------------- >>> >>> select A.PK, B.FK >>> from (select PK from A) A >>> left outer join B on (A.PK = B.FK) >>> where B.FK is not null; >>> >>> This query takes 4.5 mins in SPARK >>> >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>> >>> >>> >> >