[
https://issues.apache.org/jira/browse/SPARK-57805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57805:
-----------------------------
Description:
h3. Problem
When cost-based optimization is enabled (spark.sql.cbo.enabled=true) and column
statistics have been collected, a filter/equality/IN predicate or an equi-join
on a
column of type TIMESTAMP_NTZ, an ANSI interval (YEAR MONTH / DAY TIME), or TIME
throws
{{scala.MatchError}} and crashes query optimization.
The optimizer's statistics-estimation layer only enumerates
NUMERIC / DATE / TIMESTAMP / BOOLEAN. These three type families are collectable
(their
catalog->plan stats conversion is implemented) but are not handled on the
consumption
side, so once their stats exist they reach an unhandled branch:
* {{EstimationUtils.toDouble}} / {{fromDouble}} - matches only
{{_: NumericType | DateType | TimestampType}} (+ {{BooleanType}}).
* {{FilterEstimation.evaluateBinary}} (non-exhaustive), {{evaluateEquality}}
(via
{{ValueInterval}} -> {{toDouble}}), and {{evaluateInSet}}.
* {{JoinEstimation}} - {{ValueInterval(leftKeyStat.min, ...)}} on join keys, no
type guard.
The error is not swallowed: {{BasicStatsPlanVisitor.visitFilter}} uses
{{FilterEstimation(p).estimate.getOrElse(fallback(p))}}, which rescues a
{{None}} result,
not a thrown exception - so the {{MatchError}} propagates out.
h3. Affected types
* TIMESTAMP_NTZ - collection shipped in SPARK-42777, which touched only
{{CatalogColumnStat}} + a collection test and never the estimation side.
Latent since ~3.4.
* ANSI intervals (YearMonth / DayTime) - same gap.
* TIME - newly reachable via SPARK-54582 (PR apache/spark#53312), which adds
TIME statistics
collection.
Contrast: {{UnionEstimation.isTypeSupported}} *was* extended per-type for
TIMESTAMP_NTZ and
ANSI intervals in SPARK-37468 (it uses {{PhysicalDataType.ordering}}, so it
degrades gracefully
rather than crashing) - but TIME is still missing there too and should be added.
h3. Reproduction (unit-test form, FilterEstimationSuite)
{code:scala}
val tMin = DateTimeUtils.localTimeToNanos(LocalTime.parse("08:00:00"))
val t12 = DateTimeUtils.localTimeToNanos(LocalTime.parse("12:00:00"))
val attrTime = AttributeReference("ctime", TimeType(6))()
val colStatTime = ColumnStat(distinctCount = Some(10), min = Some(tMin), max =
Some(tMax),
nullCount = Some(0), avgLen = Some(8), maxLen = Some(8))
// Under CBO this reaches FilterEstimation -> EstimationUtils.toDouble ->
scala.MatchError.
Filter(LessThan(attrTime, Literal(t12, TimeType(6))),
childStatsTestPlan(Seq(attrTime), 10L, AttributeMap(Seq(attrTime ->
colStatTime)))).stats
{code}
h3. Proposed fix
Extend the numeric-value contract to cover all three families at once:
* {{EstimationUtils.toDouble}}: add {{TimestampNTZType | _: TimeType | _:
AnsiIntervalType}} to the
{{value.toString.toDouble}} branch (their internal values are already numeric
Long/Int).
* {{EstimationUtils.fromDouble}}: add {{TimestampNTZType | _: TimeType | _:
DayTimeIntervalType => double.toLong}}
and {{_: YearMonthIntervalType => double.toInt}}.
* {{FilterEstimation.evaluateBinary}} and {{evaluateInSet}}: add the same types
to the numeric branch.
* {{UnionEstimation.isTypeSupported}}: add {{_: TimeType}} (TIMESTAMP_NTZ /
intervals already present).
* Add {{FilterEstimationSuite}} / {{JoinEstimationSuite}} cases for TIME,
TIMESTAMP_NTZ, and an interval type.
h3. Notes
CBO is off by default, so this is latent, but it is a hard crash once enabled
with stats collected on
one of these columns.
> CBO filter/join estimation throws MatchError for TimestampNTZ, ANSI interval,
> and TIME columns
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-57805
> URL: https://issues.apache.org/jira/browse/SPARK-57805
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 5.0.0
> Reporter: Max Gekk
> Priority: Major
>
> h3. Problem
> When cost-based optimization is enabled (spark.sql.cbo.enabled=true) and
> column
> statistics have been collected, a filter/equality/IN predicate or an
> equi-join on a
> column of type TIMESTAMP_NTZ, an ANSI interval (YEAR MONTH / DAY TIME), or
> TIME throws
> {{scala.MatchError}} and crashes query optimization.
> The optimizer's statistics-estimation layer only enumerates
> NUMERIC / DATE / TIMESTAMP / BOOLEAN. These three type families are
> collectable (their
> catalog->plan stats conversion is implemented) but are not handled on the
> consumption
> side, so once their stats exist they reach an unhandled branch:
> * {{EstimationUtils.toDouble}} / {{fromDouble}} - matches only
> {{_: NumericType | DateType | TimestampType}} (+ {{BooleanType}}).
> * {{FilterEstimation.evaluateBinary}} (non-exhaustive), {{evaluateEquality}}
> (via
> {{ValueInterval}} -> {{toDouble}}), and {{evaluateInSet}}.
> * {{JoinEstimation}} - {{ValueInterval(leftKeyStat.min, ...)}} on join keys,
> no type guard.
> The error is not swallowed: {{BasicStatsPlanVisitor.visitFilter}} uses
> {{FilterEstimation(p).estimate.getOrElse(fallback(p))}}, which rescues a
> {{None}} result,
> not a thrown exception - so the {{MatchError}} propagates out.
> h3. Affected types
> * TIMESTAMP_NTZ - collection shipped in SPARK-42777, which touched only
> {{CatalogColumnStat}} + a collection test and never the estimation side.
> Latent since ~3.4.
> * ANSI intervals (YearMonth / DayTime) - same gap.
> * TIME - newly reachable via SPARK-54582 (PR apache/spark#53312), which adds
> TIME statistics
> collection.
> Contrast: {{UnionEstimation.isTypeSupported}} *was* extended per-type for
> TIMESTAMP_NTZ and
> ANSI intervals in SPARK-37468 (it uses {{PhysicalDataType.ordering}}, so it
> degrades gracefully
> rather than crashing) - but TIME is still missing there too and should be
> added.
> h3. Reproduction (unit-test form, FilterEstimationSuite)
> {code:scala}
> val tMin = DateTimeUtils.localTimeToNanos(LocalTime.parse("08:00:00"))
> val t12 = DateTimeUtils.localTimeToNanos(LocalTime.parse("12:00:00"))
> val attrTime = AttributeReference("ctime", TimeType(6))()
> val colStatTime = ColumnStat(distinctCount = Some(10), min = Some(tMin), max
> = Some(tMax),
> nullCount = Some(0), avgLen = Some(8), maxLen = Some(8))
> // Under CBO this reaches FilterEstimation -> EstimationUtils.toDouble ->
> scala.MatchError.
> Filter(LessThan(attrTime, Literal(t12, TimeType(6))),
> childStatsTestPlan(Seq(attrTime), 10L, AttributeMap(Seq(attrTime ->
> colStatTime)))).stats
> {code}
> h3. Proposed fix
> Extend the numeric-value contract to cover all three families at once:
> * {{EstimationUtils.toDouble}}: add {{TimestampNTZType | _: TimeType | _:
> AnsiIntervalType}} to the
> {{value.toString.toDouble}} branch (their internal values are already
> numeric Long/Int).
> * {{EstimationUtils.fromDouble}}: add {{TimestampNTZType | _: TimeType | _:
> DayTimeIntervalType => double.toLong}}
> and {{_: YearMonthIntervalType => double.toInt}}.
> * {{FilterEstimation.evaluateBinary}} and {{evaluateInSet}}: add the same
> types to the numeric branch.
> * {{UnionEstimation.isTypeSupported}}: add {{_: TimeType}} (TIMESTAMP_NTZ /
> intervals already present).
> * Add {{FilterEstimationSuite}} / {{JoinEstimationSuite}} cases for TIME,
> TIMESTAMP_NTZ, and an interval type.
> h3. Notes
> CBO is off by default, so this is latent, but it is a hard crash once enabled
> with stats collected on
> one of these columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]