> Yes both of these are valid ways of filtering data before join in Hive.
This has several implementation specifics attached to it. If you're looking at Hive 1.1 or before, it might not work the same way as Vineet mentioned. In older versions Calcite rewrites aren't triggered, which prevented some of the PPD rewrites. This became massively better after the pre-join transforms in Calcite were made as a default, even without statistics. I think this actually had a number of fixes in Hive 1.2.1, which also did column pruning through the join (i.e the condition column is no longer part of the join values). Hive3+ the column pruning is replaced by substitution, so you won't send the whole column through, but evaluate it to a boolean before joining (see TPC-DS query2). > As long as the join is not outer and the ON condition is not on non-null > generating side of join Hive planner will try to push the predicate down to > table scan. Since I learned it the hard way a few years back, here's an example from tpc-ds where the where clause does something different from the on clause. https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query93.q if the reason condition was an on-clause, then the join would produce different results. Because it is in the where clause, the 'Did not like the warranty' condition changes the join type on optimization. Cheers, Gopal