[
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
Summary: CBO: t is perfect. "LateralViewJoinStatsRule unintentionally
combines base table and UDTF column stats (was: CBO: t is perfect.
"LateralViewJoinStatsRule unintentionally merges base table and UDTF column
stats)
> CBO: t is perfect. "LateralViewJoinStatsRule unintentionally combines base
> table and UDTF column stats
> -------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29473
> URL: https://issues.apache.org/jira/browse/HIVE-29473
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: FIXED lateral_view_nested_stats_bug.q.out,
> lateral_view_nested_stats_bug.q, lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains multiple {{{}LATERAL VIEW{}}}s (such as
> nested {{{}posexplode{}}}s), the CBO cardinality estimation could be severely
> underestimated for downstream operators (like {{{}Group By{}}}). This loss of
> statistical accuracy leads to suboptimal execution plans, poor join choices,
> and potential resource starvation during execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}.
> When merging statistics, the rule passes the global {{columnExprMap}} to
> {{StatsUtils.getColStatisticsFromExprMap}} to evaluate the UDTF branch.
> Because the UDTF branch is built in isolation, its internal column generator
> restarts at 0, producing names like {{_col0}} and {{{}_col1{}}}. This creates
> a namespace collision with the base table's internal columns (which also use
> {{{}_col0{}}}, etc.). The utility method blindly matches these keys, which
> *causes the CBO to combine the statistics of completely unrelated columns*
> (e.g., merging the base table's {{id}} column with the UDTF's exploded array
> column). As a result, the UDTF's empty or zeroed statistics silently
> overwrite the base table's healthy statistics.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries,
> we create a firewall. {{StatsUtils}} will now only evaluate expressions that
> mathematically belong to that specific branch, preventing the namespace
> collision entirely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)