[ 
https://issues.apache.org/jira/browse/HIVE-29473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29473:
------------------------------------------
    Description: 
*Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
Optimizer (CBO) can generate inaccurate cardinality and data size estimates for 
downstream operators (such as {{{}Group By{}}}). This reduction in statistical 
accuracy—typically manifesting as artificially inflated row counts and data 
sizes—can lead to suboptimal execution plans, poor join strategy selections, 
and inefficient resource allocation during query execution.

*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
Specifically, the rule passes the same {{columnExprMap}} and full {{RowSchema}} 
to both branches.

Because the UDTF branch is compiled in isolation, its internal column generator 
restarts at 0. The bug manifests during the UDTF branch evaluation because the 
utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) incorrectly 
matches the UDTF's statistics against the {{SELECT}} branch's identical column 
names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This direct namespace 
collision *causes the CBO to combine the statistics of completely unrelated 
columns* (e.g., combining a base table's string key with a UDTF's exploded 
array column). Because the underlying merge algorithm applies maximum-value 
semantics to overlapping keys, a generated UDTF column with a larger NDV or 
{{avgColLen}} will silently overwrite the base table's true metrics, 
artificially inflating the downstream cardinality and data size estimates.

*Proposed Fix:* Enforce strict parent operator boundaries before mapping 
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
we establish strict namespace isolation. {{StatsUtils}} will now only evaluate 
expressions that mathematically belong to that specific branch, preventing the 
cross-branch namespace collision entirely.

  was:
*Symptom:* When a query contains multiple {{{}LATERAL VIEW{}}}s (such as nested 
{{{}posexplode{}}}s), the CBO cardinality estimation could be severely 
underestimated for downstream operators (like {{{}Group By{}}}). This loss of 
statistical accuracy leads to suboptimal execution plans, poor join choices, 
and potential resource starvation during execution.

*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. When 
merging statistics, the rule passes the global {{columnExprMap}} to 
{{StatsUtils.getColStatisticsFromExprMap}} to evaluate the UDTF branch.

Because the UDTF branch is built in isolation, its internal column generator 
restarts at 0, producing names like {{_col0}} and {{{}_col1{}}}. This creates a 
namespace collision with the base table's internal columns (which also use 
{{{}_col0{}}}, etc.). The utility method blindly matches these keys, which 
*causes the CBO to combine the statistics of completely unrelated columns* 
(e.g., merging the base table's {{id}} column with the UDTF's exploded array 
column). As a result, the UDTF's empty or zeroed statistics silently overwrite 
the base table's healthy statistics.

*Proposed Fix:* Enforce strict parent operator boundaries before mapping 
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
we create a firewall. {{StatsUtils}} will now only evaluate expressions that 
mathematically belong to that specific branch, preventing the namespace 
collision entirely.


> CBO: t is perfect.  "LateralViewJoinStatsRule unintentionally combines base 
> table and UDTF column stats
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29473
>                 URL: https://issues.apache.org/jira/browse/HIVE-29473
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: FIXED lateral_view_nested_stats_bug.q.out, 
> lateral_view_nested_stats_bug.q, lateral_view_nested_stats_bug.q.out
>
>
> *Symptom:* When a query contains a {{LATERAL VIEW}} operation, the Cost-Based 
> Optimizer (CBO) can generate inaccurate cardinality and data size estimates 
> for downstream operators (such as {{{}Group By{}}}). This reduction in 
> statistical accuracy—typically manifesting as artificially inflated row 
> counts and data sizes—can lead to suboptimal execution plans, poor join 
> strategy selections, and inefficient resource allocation during query 
> execution.
> *Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. 
> Specifically, the rule passes the same {{columnExprMap}} and full 
> {{RowSchema}} to both branches.
> Because the UDTF branch is compiled in isolation, its internal column 
> generator restarts at 0. The bug manifests during the UDTF branch evaluation 
> because the utility method ({{{}StatsUtils.getColStatisticsFromExprMap{}}}) 
> incorrectly matches the UDTF's statistics against the {{SELECT}} branch's 
> identical column names (e.g., {{{}_col0{}}}, {{{}_col1{}}}, etc.). This 
> direct namespace collision *causes the CBO to combine the statistics of 
> completely unrelated columns* (e.g., combining a base table's string key with 
> a UDTF's exploded array column). Because the underlying merge algorithm 
> applies maximum-value semantics to overlapping keys, a generated UDTF column 
> with a larger NDV or {{avgColLen}} will silently overwrite the base table's 
> true metrics, artificially inflating the downstream cardinality and data size 
> estimates.
> *Proposed Fix:* Enforce strict parent operator boundaries before mapping 
> statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into 
> isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries, 
> we establish strict namespace isolation. {{StatsUtils}} will now only 
> evaluate expressions that mathematically belong to that specific branch, 
> preventing the cross-branch namespace collision entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to