Max Gekk created SPARK-57849:
--------------------------------
Summary: Support the TIME data type in DataFrame approxQuantile,
summary and describe
Key: SPARK-57849
URL: https://issues.apache.org/jira/browse/SPARK-57849
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 4.3.0
Reporter: Max Gekk
This sub-task is part of the umbrella SPARK-57550 (extend support for the TIME
data type).
h2. Problem
The DataFrame stat APIs do not handle TIME:
{{StatFunctions.multipleApproxQuantiles}} requires {{NumericType}} and casts to
{{DoubleType}} (execution/stat/StatFunctions.scala ~L74-77), and {{summary}}
includes only numeric/string columns (~L194). SQL {{approx_percentile}} already
supports TIME (SPARK-57557), but {{df.stat.approxQuantile}} / {{df.summary()}}
/ {{df.describe()}} do not route TIME there.
h2. Goal
Allow TIME columns in {{stat.approxQuantile}} (computing on nanos-of-day and
returning typed TIME quantiles) and include TIME in {{summary()}} /
{{describe()}} percentile rows.
h2. Scope
Extend {{multipleApproxQuantiles}} to accept {{TimeType}} (delegate to the
nanos-of-day domain, map results back to TIME); include TIME columns in
{{summary}}/{{describe}} percentile computation.
h2. Acceptance criteria
* {{df.stat.approxQuantile("t", Array(0.5), 0.01)}} works for a TIME column;
{{df.summary()}} shows TIME percentiles.
h2. Testing
{{DataFrameStatSuite}} / {{StatFunctionsSuite}}.
h2. Dependencies
None - conceptually related to SPARK-57557 (SQL-level TIME quantiles).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]