[
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
Description:
*Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based
Optimizer (CBO) can generate inaccurate cardinality and data size estimates for
downstream operators (such as {{{}Group By{}}}). This reduction in statistical
accuracy—typically manifesting as artificially inflated row counts and data
sizes—can lead to suboptimal execution plans, poor join strategy selections,
and inefficient resource allocation during query execution.
*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}.
Specifically, the rule passes the same {{columnExprMap}} and full {{RowSchema}}
to both branches.
Because the UDTF branch is compiled in isolation, its internal column generator
restarts at 0. The bug manifests during the UDTF branch evaluation because the
utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) incorrectly
matches the UDTF's statistics against the {{SELECT}} branch's identical column
names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This direct namespace
collision *causes the CBO to combine the statistics of completely unrelated
columns* (e.g., combining a base table's string key with a UDTF's exploded
array column). Because the underlying merge algorithm applies maximum-value
semantics to overlapping keys, a generated UDTF column with a larger NDV or
{{avgColLen}} will silently overwrite the base table's true metrics,
artificially inflating the downstream cardinality and data size estimates.
*Proposed Fix:* Enforce strict parent operator boundaries before mapping
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries,
we establish strict namespace isolation. {{StatsUtils}} will now only evaluate
expressions that mathematically belong to that specific branch, preventing the
cross-branch namespace collision entirely.
was:
*Symptom:* When a query contains multiple {{{}LATERAL VIEW{}}}s (such as nested
{{{}posexplode{}}}s), the CBO cardinality estimation could be severely
underestimated for downstream operators (like {{{}Group By{}}}). This loss of
statistical accuracy leads to suboptimal execution plans, poor join choices,
and potential resource starvation during execution.
*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. When
merging statistics, the rule passes the global {{columnExprMap}} to
{{StatsUtils.getColStatisticsFromExprMap}} to evaluate the UDTF branch.
Because the UDTF branch is built in isolation, its internal column generator
restarts at 0, producing names like {{_col0}} and {{{}_col1{}}}. This creates a
namespace collision with the base table's internal columns (which also use
{{{}_col0{}}}, etc.). The utility method blindly matches these keys, which
*causes the CBO to combine the statistics of completely unrelated columns*
(e.g., merging the base table's {{id}} column with the UDTF's exploded array
column). As a result, the UDTF's empty or zeroed statistics silently overwrite
the base table's healthy statistics.
*Proposed Fix:* Enforce strict parent operator boundaries before mapping
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries,
we create a firewall. {{StatsUtils}} will now only evaluate expressions that
mathematically belong to that specific branch, preventing the namespace
collision entirely.
> CBO: t is perfect. "LateralViewJoinStatsRule unintentionally combines base
> table and UDTF column stats
> -------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29473
> URL: https://issues.apache.org/jira/browse/HIVE-29473
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: FIXED lateral_view_nested_stats_bug.q.out,
> lateral_view_nested_stats_bug.q, lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based
> Optimizer (CBO) can generate inaccurate cardinality and data size estimates
> for downstream operators (such as {{{}Group By{}}}). This reduction in
> statistical accuracy—typically manifesting as artificially inflated row
> counts and data sizes—can lead to suboptimal execution plans, poor join
> strategy selections, and inefficient resource allocation during query
> execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}.
> Specifically, the rule passes the same {{columnExprMap}} and full
> {{RowSchema}} to both branches.
> Because the UDTF branch is compiled in isolation, its internal column
> generator restarts at 0. The bug manifests during the UDTF branch evaluation
> because the utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}})
> incorrectly matches the UDTF's statistics against the {{SELECT}} branch's
> identical column names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This
> direct namespace collision *causes the CBO to combine the statistics of
> completely unrelated columns* (e.g., combining a base table's string key with
> a UDTF's exploded array column). Because the underlying merge algorithm
> applies maximum-value semantics to overlapping keys, a generated UDTF column
> with a larger NDV or {{avgColLen}} will silently overwrite the base table's
> true metrics, artificially inflating the downstream cardinality and data size
> estimates.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries,
> we establish strict namespace isolation. {{StatsUtils}} will now only
> evaluate expressions that mathematically belong to that specific branch,
> preventing the cross-branch namespace collision entirely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)