Konstantin Bereznyakov created HIVE-29473:
---------------------------------------------

             Summary: LateralViewJoinStatsRule combines stats of unrelated 
columns on 2+ LV queries, corrupting CBO estimates
                 Key: HIVE-29473
                 URL: https://issues.apache.org/jira/browse/HIVE-29473
             Project: Hive
          Issue Type: Bug
          Components: CBO
            Reporter: Konstantin Bereznyakov
            Assignee: Konstantin Bereznyakov


*Symptom:* When a query contains multiple {{{}LATERAL VIEW{}}}s (such as nested 
{{{}posexplode{}}}s), the CBO cardinality estimation could be severely 
underestimated for downstream operators (like {{{}Group By{}}}). This loss of 
statistical accuracy leads to suboptimal execution plans, poor join choices, 
and potential resource starvation during execution.

*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. When 
merging statistics, the rule passes the global {{columnExprMap}} to 
{{StatsUtils.getColStatisticsFromExprMap}} to evaluate the UDTF branch.

Because the UDTF branch is built in isolation, its internal column generator 
restarts at 0, producing names like {{_col0}} and {{{}_col1{}}}. This creates a 
namespace collision with the base table's internal columns (which also use 
{{{}_col0{}}}, etc.). The utility method blindly matches these keys, which 
*causes the CBO to combine the statistics of completely unrelated columns* 
(e.g., merging the base table's {{id}} column with the UDTF's exploded array 
column). As a result, the UDTF's empty or zeroed statistics silently overwrite 
the base table's healthy statistics.

*Proposed Fix:* Enforce strict parent operator boundaries before mapping 
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
we create a firewall. {{StatsUtils}} will now only evaluate expressions that 
mathematically belong to that specific branch, preventing the namespace 
collision entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to