[ 
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
    Summary: CBO: LateralViewJoinStatsRule unintentionally combines base table 
and UDTF column stats  (was: CBO: t is perfect.  "LateralViewJoinStatsRule 
unintentionally combines base table and UDTF column stats)

> CBO: LateralViewJoinStatsRule unintentionally combines base table and UDTF 
> column stats
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-29473
>                 URL: https://issues.apache.org/jira/browse/HIVE-29473
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: FIXED lateral_view_nested_stats_bug.q.out, 
> lateral_view_nested_stats_bug.q, lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
> Optimizer (CBO) can generate inaccurate cardinality and data size estimates 
> for downstream operators (such as {{{}Group By{}}}). This reduction in 
> statistical accuracy—typically manifesting as artificially inflated row 
> counts and data sizes—can lead to suboptimal execution plans, poor join 
> strategy selections, and inefficient resource allocation during query 
> execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
> Specifically, the rule passes the same {{columnExprMap}} and full 
> {{RowSchema}} to both branches.
> Because the UDTF branch is compiled in isolation, its internal column 
> generator restarts at 0. The bug manifests during the UDTF branch evaluation 
> because the utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) 
> incorrectly matches the UDTF's statistics against the {{SELECT}} branch's 
> identical column names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This 
> direct namespace collision *causes the CBO to combine the statistics of 
> completely unrelated columns* (e.g., combining a base table's string key with 
> a UDTF's exploded array column). Because the underlying merge algorithm 
> applies maximum-value semantics to overlapping keys, a generated UDTF column 
> with a larger NDV or {{avgColLen}} will silently overwrite the base table's 
> true metrics, artificially inflating the downstream cardinality and data size 
> estimates.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping 
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
> we establish strict namespace isolation. {{StatsUtils}} will now only 
> evaluate expressions that mathematically belong to that specific branch, 
> preventing the cross-branch namespace collision entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to