SparkSQL - Limit pushdown on BroadcastHashJoin

Rajesh Balamohan Tue, 12 Apr 2016 05:33:07 -0700

Hi,

I ran the following query in spark (latest master codebase) and it took a
lot of time to complete even though it was a broadcast hash join.


It appears that limit computation is done only after computing complete
join condition.  Shouldn't the limit condition be pushed to
BroadcastHashJoin (wherein it would have to stop processing after
generating 10 rows?).  Please let me know if my understanding on this is
wrong.


select l_partkey from lineitem, partsupp where ps_partkey=l_partkey limit
10;

>>>>
| == Physical Plan ==
CollectLimit 10
+- WholeStageCodegen
   :  +- Project [l_partkey#893]
   :     +- BroadcastHashJoin [l_partkey#893], [ps_partkey#908], Inner,
BuildRight, None
   :        :- Project [l_partkey#893]
   :        :  +- Filter isnotnull(l_partkey#893)
   :        :     +- Scan HadoopFiles[l_partkey#893] Format: ORC,
PushedFilters: [IsNotNull(l_partkey)], ReadSchema: struct<l_partkey:int>
   :        +- INPUT
   +- BroadcastExchange
HashedRelationBroadcastMode(true,List(cast(ps_partkey#908 as
bigint)),List(ps_partkey#908))
      +- WholeStageCodegen
         :  +- Project [ps_partkey#908]
         :     +- Filter isnotnull(ps_partkey#908)
         :        +- Scan HadoopFiles[ps_partkey#908] Format: ORC,
PushedFilters: [IsNotNull(ps_partkey)], ReadSchema: struct<ps_partkey:int>
 |
>>>>




-- 
~Rajesh.B

SparkSQL - Limit pushdown on BroadcastHashJoin

Reply via email to