[ 
https://issues.apache.org/jira/browse/SPARK-57805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57805:
-----------------------------
    Description: 
h3. Problem

When cost-based optimization is enabled (spark.sql.cbo.enabled=true) and column
statistics have been collected, a filter/equality/IN predicate or an equi-join 
on a
column of type TIMESTAMP_NTZ, an ANSI interval (YEAR MONTH / DAY TIME), or TIME 
throws
{{scala.MatchError}} and crashes query optimization.

The optimizer's statistics-estimation layer only enumerates
NUMERIC / DATE / TIMESTAMP / BOOLEAN. These three type families are collectable 
(their
catalog->plan stats conversion is implemented) but are not handled on the 
consumption
side, so once their stats exist they reach an unhandled branch:

* {{EstimationUtils.toDouble}} / {{fromDouble}} - matches only
  {{_: NumericType | DateType | TimestampType}} (+ {{BooleanType}}).
* {{FilterEstimation.evaluateBinary}} (non-exhaustive), {{evaluateEquality}} 
(via
  {{ValueInterval}} -> {{toDouble}}), and {{evaluateInSet}}.
* {{JoinEstimation}} - {{ValueInterval(leftKeyStat.min, ...)}} on join keys, no 
type guard.

The error is not swallowed: {{BasicStatsPlanVisitor.visitFilter}} uses
{{FilterEstimation(p).estimate.getOrElse(fallback(p))}}, which rescues a 
{{None}} result,
not a thrown exception - so the {{MatchError}} propagates out.

h3. Affected types

* TIMESTAMP_NTZ - collection shipped in SPARK-42777, which touched only
  {{CatalogColumnStat}} + a collection test and never the estimation side. 
Latent since ~3.4.
* ANSI intervals (YearMonth / DayTime) - same gap.
* TIME - newly reachable via SPARK-54582 (PR apache/spark#53312), which adds 
TIME statistics
  collection.

Contrast: {{UnionEstimation.isTypeSupported}} *was* extended per-type for 
TIMESTAMP_NTZ and
ANSI intervals in SPARK-37468 (it uses {{PhysicalDataType.ordering}}, so it 
degrades gracefully
rather than crashing) - but TIME is still missing there too and should be added.

h3. Reproduction (unit-test form, FilterEstimationSuite)

{code:scala}
val tMin = DateTimeUtils.localTimeToNanos(LocalTime.parse("08:00:00"))
val t12  = DateTimeUtils.localTimeToNanos(LocalTime.parse("12:00:00"))
val attrTime = AttributeReference("ctime", TimeType(6))()
val colStatTime = ColumnStat(distinctCount = Some(10), min = Some(tMin), max = 
Some(tMax),
  nullCount = Some(0), avgLen = Some(8), maxLen = Some(8))
// Under CBO this reaches FilterEstimation -> EstimationUtils.toDouble -> 
scala.MatchError.
Filter(LessThan(attrTime, Literal(t12, TimeType(6))),
  childStatsTestPlan(Seq(attrTime), 10L, AttributeMap(Seq(attrTime -> 
colStatTime)))).stats
{code}

h3. Proposed fix

Extend the numeric-value contract to cover all three families at once:

* {{EstimationUtils.toDouble}}: add {{TimestampNTZType | _: TimeType | _: 
AnsiIntervalType}} to the
  {{value.toString.toDouble}} branch (their internal values are already numeric 
Long/Int).
* {{EstimationUtils.fromDouble}}: add {{TimestampNTZType | _: TimeType | _: 
DayTimeIntervalType => double.toLong}}
  and {{_: YearMonthIntervalType => double.toInt}}.
* {{FilterEstimation.evaluateBinary}} and {{evaluateInSet}}: add the same types 
to the numeric branch.
* {{UnionEstimation.isTypeSupported}}: add {{_: TimeType}} (TIMESTAMP_NTZ / 
intervals already present).
* Add {{FilterEstimationSuite}} / {{JoinEstimationSuite}} cases for TIME, 
TIMESTAMP_NTZ, and an interval type.

h3. Notes

CBO is off by default, so this is latent, but it is a hard crash once enabled 
with stats collected on
one of these columns.


> CBO filter/join estimation throws MatchError for TimestampNTZ, ANSI interval, 
> and TIME columns
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57805
>                 URL: https://issues.apache.org/jira/browse/SPARK-57805
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 5.0.0
>            Reporter: Max Gekk
>            Priority: Major
>
> h3. Problem
> When cost-based optimization is enabled (spark.sql.cbo.enabled=true) and 
> column
> statistics have been collected, a filter/equality/IN predicate or an 
> equi-join on a
> column of type TIMESTAMP_NTZ, an ANSI interval (YEAR MONTH / DAY TIME), or 
> TIME throws
> {{scala.MatchError}} and crashes query optimization.
> The optimizer's statistics-estimation layer only enumerates
> NUMERIC / DATE / TIMESTAMP / BOOLEAN. These three type families are 
> collectable (their
> catalog->plan stats conversion is implemented) but are not handled on the 
> consumption
> side, so once their stats exist they reach an unhandled branch:
> * {{EstimationUtils.toDouble}} / {{fromDouble}} - matches only
>   {{_: NumericType | DateType | TimestampType}} (+ {{BooleanType}}).
> * {{FilterEstimation.evaluateBinary}} (non-exhaustive), {{evaluateEquality}} 
> (via
>   {{ValueInterval}} -> {{toDouble}}), and {{evaluateInSet}}.
> * {{JoinEstimation}} - {{ValueInterval(leftKeyStat.min, ...)}} on join keys, 
> no type guard.
> The error is not swallowed: {{BasicStatsPlanVisitor.visitFilter}} uses
> {{FilterEstimation(p).estimate.getOrElse(fallback(p))}}, which rescues a 
> {{None}} result,
> not a thrown exception - so the {{MatchError}} propagates out.
> h3. Affected types
> * TIMESTAMP_NTZ - collection shipped in SPARK-42777, which touched only
>   {{CatalogColumnStat}} + a collection test and never the estimation side. 
> Latent since ~3.4.
> * ANSI intervals (YearMonth / DayTime) - same gap.
> * TIME - newly reachable via SPARK-54582 (PR apache/spark#53312), which adds 
> TIME statistics
>   collection.
> Contrast: {{UnionEstimation.isTypeSupported}} *was* extended per-type for 
> TIMESTAMP_NTZ and
> ANSI intervals in SPARK-37468 (it uses {{PhysicalDataType.ordering}}, so it 
> degrades gracefully
> rather than crashing) - but TIME is still missing there too and should be 
> added.
> h3. Reproduction (unit-test form, FilterEstimationSuite)
> {code:scala}
> val tMin = DateTimeUtils.localTimeToNanos(LocalTime.parse("08:00:00"))
> val t12  = DateTimeUtils.localTimeToNanos(LocalTime.parse("12:00:00"))
> val attrTime = AttributeReference("ctime", TimeType(6))()
> val colStatTime = ColumnStat(distinctCount = Some(10), min = Some(tMin), max 
> = Some(tMax),
>   nullCount = Some(0), avgLen = Some(8), maxLen = Some(8))
> // Under CBO this reaches FilterEstimation -> EstimationUtils.toDouble -> 
> scala.MatchError.
> Filter(LessThan(attrTime, Literal(t12, TimeType(6))),
>   childStatsTestPlan(Seq(attrTime), 10L, AttributeMap(Seq(attrTime -> 
> colStatTime)))).stats
> {code}
> h3. Proposed fix
> Extend the numeric-value contract to cover all three families at once:
> * {{EstimationUtils.toDouble}}: add {{TimestampNTZType | _: TimeType | _: 
> AnsiIntervalType}} to the
>   {{value.toString.toDouble}} branch (their internal values are already 
> numeric Long/Int).
> * {{EstimationUtils.fromDouble}}: add {{TimestampNTZType | _: TimeType | _: 
> DayTimeIntervalType => double.toLong}}
>   and {{_: YearMonthIntervalType => double.toInt}}.
> * {{FilterEstimation.evaluateBinary}} and {{evaluateInSet}}: add the same 
> types to the numeric branch.
> * {{UnionEstimation.isTypeSupported}}: add {{_: TimeType}} (TIMESTAMP_NTZ / 
> intervals already present).
> * Add {{FilterEstimationSuite}} / {{JoinEstimationSuite}} cases for TIME, 
> TIMESTAMP_NTZ, and an interval type.
> h3. Notes
> CBO is off by default, so this is latent, but it is a hard crash once enabled 
> with stats collected on
> one of these columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to