Konstantin Bereznyakov created HIVE-29473:
---------------------------------------------
Summary: LateralViewJoinStatsRule combines stats of unrelated
columns on 2+ LV queries, corrupting CBO estimates
Key: HIVE-29473
URL: https://issues.apache.org/jira/browse/HIVE-29473
Project: Hive
Issue Type: Bug
Components: CBO
Reporter: Konstantin Bereznyakov
Assignee: Konstantin Bereznyakov
*Symptom:* When a query contains multiple {{{}LATERAL VIEW{}}}s (such as nested
{{{}posexplode{}}}s), the CBO cardinality estimation could be severely
underestimated for downstream operators (like {{{}Group By{}}}). This loss of
statistical accuracy leads to suboptimal execution plans, poor join choices,
and potential resource starvation during execution.
*Root Cause:* The bug lies in {{{}LateralViewJoinStatsRule.process(){}}}. When
merging statistics, the rule passes the global {{columnExprMap}} to
{{StatsUtils.getColStatisticsFromExprMap}} to evaluate the UDTF branch.
Because the UDTF branch is built in isolation, its internal column generator
restarts at 0, producing names like {{_col0}} and {{{}_col1{}}}. This creates a
namespace collision with the base table's internal columns (which also use
{{{}_col0{}}}, etc.). The utility method blindly matches these keys, which
*causes the CBO to combine the statistics of completely unrelated columns*
(e.g., merging the base table's {{id}} column with the UDTF's exploded array
column). As a result, the UDTF's empty or zeroed statistics silently overwrite
the base table's healthy statistics.
*Proposed Fix:* Enforce strict parent operator boundaries before mapping
statistics. By slicing both the {{RowSchema}} and the {{columnExprMap}} into
isolated collections based on the {{SELECT_TAG}} and {{UDTF_TAG}} boundaries,
we create a firewall. {{StatsUtils}} will now only evaluate expressions that
mathematically belong to that specific branch, preventing the namespace
collision entirely.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)