[
https://issues.apache.org/jira/browse/SPARK-57805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57805:
-----------------------------
Affects Version/s: 4.3.0
(was: 5.0.0)
> CBO filter/join estimation throws MatchError for TimestampNTZ, ANSI interval,
> and TIME columns
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-57805
> URL: https://issues.apache.org/jira/browse/SPARK-57805
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> h3. Problem
> When cost-based optimization is enabled (spark.sql.cbo.enabled=true) and
> column
> statistics have been collected, a filter/equality/IN predicate or an
> equi-join on a
> column of type TIMESTAMP_NTZ, an ANSI interval (YEAR MONTH / DAY TIME), or
> TIME throws
> {{scala.MatchError}} and crashes query optimization.
> The optimizer's statistics-estimation layer only enumerates
> NUMERIC / DATE / TIMESTAMP / BOOLEAN. These three type families are
> collectable (their
> catalog->plan stats conversion is implemented) but are not handled on the
> consumption
> side, so once their stats exist they reach an unhandled branch:
> * {{EstimationUtils.toDouble}} / {{fromDouble}} - matches only
> {{_: NumericType | DateType | TimestampType}} (+ {{BooleanType}}).
> * {{FilterEstimation.evaluateBinary}} (non-exhaustive), {{evaluateEquality}}
> (via
> {{ValueInterval}} -> {{toDouble}}), and {{evaluateInSet}}.
> * {{JoinEstimation}} - {{ValueInterval(leftKeyStat.min, ...)}} on join keys,
> no type guard.
> The error is not swallowed: {{BasicStatsPlanVisitor.visitFilter}} uses
> {{FilterEstimation(p).estimate.getOrElse(fallback(p))}}, which rescues a
> {{None}} result,
> not a thrown exception - so the {{MatchError}} propagates out.
> h3. Affected types
> * TIMESTAMP_NTZ - collection shipped in SPARK-42777, which touched only
> {{CatalogColumnStat}} + a collection test and never the estimation side.
> Latent since ~3.4.
> * ANSI intervals (YearMonth / DayTime) - same gap.
> * TIME - newly reachable via SPARK-54582 (PR apache/spark#53312), which adds
> TIME statistics
> collection.
> Contrast: {{UnionEstimation.isTypeSupported}} *was* extended per-type for
> TIMESTAMP_NTZ and
> ANSI intervals in SPARK-37468 (it uses {{PhysicalDataType.ordering}}, so it
> degrades gracefully
> rather than crashing) - but TIME is still missing there too and should be
> added.
> h3. Reproduction (unit-test form, FilterEstimationSuite)
> {code:scala}
> val tMin = DateTimeUtils.localTimeToNanos(LocalTime.parse("08:00:00"))
> val t12 = DateTimeUtils.localTimeToNanos(LocalTime.parse("12:00:00"))
> val attrTime = AttributeReference("ctime", TimeType(6))()
> val colStatTime = ColumnStat(distinctCount = Some(10), min = Some(tMin), max
> = Some(tMax),
> nullCount = Some(0), avgLen = Some(8), maxLen = Some(8))
> // Under CBO this reaches FilterEstimation -> EstimationUtils.toDouble ->
> scala.MatchError.
> Filter(LessThan(attrTime, Literal(t12, TimeType(6))),
> childStatsTestPlan(Seq(attrTime), 10L, AttributeMap(Seq(attrTime ->
> colStatTime)))).stats
> {code}
> h3. Proposed fix
> Extend the numeric-value contract to cover all three families at once:
> * {{EstimationUtils.toDouble}}: add {{TimestampNTZType | _: TimeType | _:
> AnsiIntervalType}} to the
> {{value.toString.toDouble}} branch (their internal values are already
> numeric Long/Int).
> * {{EstimationUtils.fromDouble}}: add {{TimestampNTZType | _: TimeType | _:
> DayTimeIntervalType => double.toLong}}
> and {{_: YearMonthIntervalType => double.toInt}}.
> * {{FilterEstimation.evaluateBinary}} and {{evaluateInSet}}: add the same
> types to the numeric branch.
> * {{UnionEstimation.isTypeSupported}}: add {{_: TimeType}} (TIMESTAMP_NTZ /
> intervals already present).
> * Add {{FilterEstimationSuite}} / {{JoinEstimationSuite}} cases for TIME,
> TIMESTAMP_NTZ, and an interval type.
> h3. Notes
> CBO is off by default, so this is latent, but it is a hard crash once enabled
> with stats collected on
> one of these columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]