[
https://issues.apache.org/jira/browse/SPARK-57849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57849:
-----------------------------
Shepherd: Max Gekk
> Support the TIME data type in DataFrame approxQuantile, summary and describe
> ----------------------------------------------------------------------------
>
> Key: SPARK-57849
> URL: https://issues.apache.org/jira/browse/SPARK-57849
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Max Gekk
> Priority: Major
>
> This sub-task is part of the umbrella SPARK-57550 (extend support for the
> TIME data type).
> h2. Problem
> The DataFrame stat APIs do not handle TIME:
> {{StatFunctions.multipleApproxQuantiles}} requires {{NumericType}} and casts
> to {{DoubleType}} (execution/stat/StatFunctions.scala ~L74-77), and
> {{summary}} includes only numeric/string columns (~L194). SQL
> {{approx_percentile}} already supports TIME (SPARK-57557), but
> {{df.stat.approxQuantile}} / {{df.summary()}} / {{df.describe()}} do not
> route TIME there.
> h2. Goal
> Allow TIME columns in {{stat.approxQuantile}} (computing on nanos-of-day and
> returning typed TIME quantiles) and include TIME in {{summary()}} /
> {{describe()}} percentile rows.
> h2. Scope
> Extend {{multipleApproxQuantiles}} to accept {{TimeType}} (delegate to the
> nanos-of-day domain, map results back to TIME); include TIME columns in
> {{summary}}/{{describe}} percentile computation.
> h2. Acceptance criteria
> * {{df.stat.approxQuantile("t", Array(0.5), 0.01)}} works for a TIME column;
> {{df.summary()}} shows TIME percentiles.
> h2. Testing
> {{DataFrameStatSuite}} / {{StatFunctionsSuite}}.
> h2. Dependencies
> None - conceptually related to SPARK-57557 (SQL-level TIME quantiles).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]