> Yes both of these are valid ways of filtering data before join in Hive.

This has several implementation specifics attached to it. If you're looking at 
Hive 1.1 or before, it might not work the same way as Vineet mentioned.

In older versions Calcite rewrites aren't triggered, which prevented some of 
the PPD rewrites.

This became massively better after the pre-join transforms in Calcite were made 
as a default, even without statistics.

I think this actually had a number of fixes in Hive 1.2.1, which also did 
column pruning through the join (i.e the condition column is no longer part of 
the join values).

Hive3+ the column pruning is replaced by substitution, so you won't send the 
whole column through, but evaluate it to a boolean before joining (see TPC-DS 
query2).

> As long as the join is not outer and the ON condition is not on non-null 
> generating side of join Hive planner will try to push the predicate down to 
> table scan.

Since I learned it the hard  way a few years back, here's an example from 
tpc-ds where the where clause does something different from the on clause.

https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query93.q

if the reason condition was an on-clause, then the join would produce different 
results. Because it is in the where clause, the 'Did not like the warranty' 
condition changes the join type on optimization.

Cheers,
Gopal


Reply via email to