[ 
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
    Attachment: HIVE-29473.patch
        Status: Patch Available  (was: In Progress)

> CBO: LateralViewJoinStatsRule unintentionally combines base table and UDTF 
> column stats
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-29473
>                 URL: https://issues.apache.org/jira/browse/HIVE-29473
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: FIXED lateral_view_nested_stats_bug.q.out, 
> HIVE-29473.patch, lateral_view_nested_stats_bug.q, 
> lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
> Optimizer (CBO) can generate inaccurate cardinality and data size estimates 
> for downstream operators (such as {{{}Group By{}}}). This reduction in 
> statistical accuracy—typically manifesting as artificially inflated row 
> counts and data sizes—can lead to suboptimal execution plans, poor join 
> strategy selections, and inefficient resource allocation during query 
> execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
> Specifically, the rule passes the same {{columnExprMap}} and full 
> {{RowSchema}} to both branches.
> Because the UDTF branch is compiled in isolation, its internal column 
> generator restarts at 0. The bug manifests during the UDTF branch evaluation 
> because the utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) 
> incorrectly matches the UDTF's statistics against the {{SELECT}} branch's 
> identical column names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This 
> direct namespace collision *causes the CBO to combine the statistics of 
> completely unrelated columns* (e.g., combining a base table's string key with 
> a UDTF's exploded array column via {*}joinedStats.addToColumnStats(){*}). 
> Because the underlying merge algorithm applies maximum-value semantics to 
> overlapping keys, a generated UDTF column with a larger NDV or {{avgColLen}} 
> will silently overwrite the base table's true metrics, artificially inflating 
> the downstream cardinality and data size estimates.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping 
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
> we establish strict namespace isolation. {{StatsUtils}} will now only 
> evaluate expressions that mathematically belong to that specific branch, 
> preventing the cross-branch namespace collision entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to