[
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
Attachment: HIVE-29473.patch
Status: Patch Available (was: In Progress)
> CBO: LateralViewJoinStatsRule unintentionally combines base table and UDTF
> column stats
> ---------------------------------------------------------------------------------------
>
> Key: HIVE-29473
> URL: https://issues.apache.org/jira/browse/HIVE-29473
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: FIXED lateral_view_nested_stats_bug.q.out,
> HIVE-29473.patch, lateral_view_nested_stats_bug.q,
> lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based
> Optimizer (CBO) can generate inaccurate cardinality and data size estimates
> for downstream operators (such as {{{}Group By{}}}). This reduction in
> statistical accuracy—typically manifesting as artificially inflated row
> counts and data sizes—can lead to suboptimal execution plans, poor join
> strategy selections, and inefficient resource allocation during query
> execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}.
> Specifically, the rule passes the same {{columnExprMap}} and full
> {{RowSchema}} to both branches.
> Because the UDTF branch is compiled in isolation, its internal column
> generator restarts at 0. The bug manifests during the UDTF branch evaluation
> because the utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}})
> incorrectly matches the UDTF's statistics against the {{SELECT}} branch's
> identical column names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This
> direct namespace collision *causes the CBO to combine the statistics of
> completely unrelated columns* (e.g., combining a base table's string key with
> a UDTF's exploded array column via {*}joinedStats.addToColumnStats(){*}).
> Because the underlying merge algorithm applies maximum-value semantics to
> overlapping keys, a generated UDTF column with a larger NDV or {{avgColLen}}
> will silently overwrite the base table's true metrics, artificially inflating
> the downstream cardinality and data size estimates.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries,
> we establish strict namespace isolation. {{StatsUtils}} will now only
> evaluate expressions that mathematically belong to that specific branch,
> preventing the cross-branch namespace collision entirely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)