Max Gekk created SPARK-57849:
--------------------------------

             Summary: Support the TIME data type in DataFrame approxQuantile, 
summary and describe
                 Key: SPARK-57849
                 URL: https://issues.apache.org/jira/browse/SPARK-57849
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 4.3.0
            Reporter: Max Gekk


This sub-task is part of the umbrella SPARK-57550 (extend support for the TIME 
data type).

h2. Problem
The DataFrame stat APIs do not handle TIME: 
{{StatFunctions.multipleApproxQuantiles}} requires {{NumericType}} and casts to 
{{DoubleType}} (execution/stat/StatFunctions.scala ~L74-77), and {{summary}} 
includes only numeric/string columns (~L194). SQL {{approx_percentile}} already 
supports TIME (SPARK-57557), but {{df.stat.approxQuantile}} / {{df.summary()}} 
/ {{df.describe()}} do not route TIME there.

h2. Goal
Allow TIME columns in {{stat.approxQuantile}} (computing on nanos-of-day and 
returning typed TIME quantiles) and include TIME in {{summary()}} / 
{{describe()}} percentile rows.

h2. Scope
Extend {{multipleApproxQuantiles}} to accept {{TimeType}} (delegate to the 
nanos-of-day domain, map results back to TIME); include TIME columns in 
{{summary}}/{{describe}} percentile computation.

h2. Acceptance criteria
* {{df.stat.approxQuantile("t", Array(0.5), 0.01)}} works for a TIME column; 
{{df.summary()}} shows TIME percentiles.

h2. Testing
{{DataFrameStatSuite}} / {{StatFunctionsSuite}}.

h2. Dependencies
None - conceptually related to SPARK-57557 (SQL-level TIME quantiles).




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to