[ 
https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-51162:
----------------------------------
    Affects Version/s: 4.1.0
                           (was: 4.0.0)

> SPIP: Add the TIME data type
> ----------------------------
>
>                 Key: SPARK-51162
>                 URL: https://issues.apache.org/jira/browse/SPARK-51162
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>              Labels: SPIP
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> Add new data type *TIME* to Spark SQL which represents a time value with 
> fields hour, minute, second, up to microseconds. All operations over the type 
> are performed without taking any time zone into account. New data type should 
> conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard 
> where 0 <= n <= 6.
> *Q2. What problem is this proposal NOT designed to solve?*
> Don't support the TIME type with time zone defined by the SQL standard: 
> {*}TIME\(n\) WITH TIME ZONE{*}.
> Also don't support TIME with local timezone.
> *Q3. How is it done today, and what are the limits of current practice?*
> The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the 
> date part to the some constant value like 1970-01-01, 0001-01-01 or 
> 0000-00-00 (though this is out of supported range of dates).
> Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot 
> recognize it in data sources, and for instance cannot load the TIME values 
> from parquet files.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The approach is not new, and we have clear picture how to split the work by 
> sub-tasks based on our experience of adding new types ANSI intervals and 
> TIMESTAMP_NTZ.
> *Q5. Who cares? If you are successful, what difference will it make?*
> The new type simplifies migrations to Spark SQL from other DBMS like 
> PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users 
> don't have to rewrite their SQL code to emulate the TIME type. Also new 
> functionality impacts on existing Spark SQL users who need to load data w/ 
> the TIME values that were stored by other systems.
> *Q6. What are the risks?*
> Additional handling new type in operators, expression and data sources can 
> cause performance regressions. Such risk can be compensated by developing 
> time benchmarks in parallel with supporting new type in different places in 
> Spark SQL.
>  
> *Q7. How long will it take?*
> In total it might take around {*}9 months{*}. The estimation is based on 
> similar tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). 
> We can split the work by function blocks:
>  # Base functionality - *3 weeks*
> Add new type TimeType, forming/parsing time literals, type constructor, and 
> external types.
>  # Persistence - *3.5 months*
> Ability to create tables of the type TIME, read/write from/to Parquet and 
> other built-in data types, partitioning, stats, predicate push down.
>  # Time operators - *2 months*
> Arithmetic ops, field extract, sorting, and aggregations.
>  # Clients support - *1 month*
> JDBC, Hive, Thrift server, connect
>  # PySpark integration - *1 month*
> DataFrame support, pandas API, python UDFs, Arrow column vectors
>  # Docs + testing/benchmarking - *1 month*
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term is in 4 month: basic functionality, read/write new type to 
> built-in datasources, basic time operations such as arithmetic ops, casting.
> The final "exams" is to support the same functionality as other time types: 
> TIMESTAMP_NTZ, DATE, TIMESTAMP.
> *Appendix A. Proposed API Changes.*
> Add new case class *TimeType* to {_}org.apache.spark.sql.types{_}:
> {code:scala}
> /**
>  * The time type represents a time value with fields hour, minute, second, up 
> to microseconds.
>  * The range of times supported is 00:00:00.000000 to 23:59:59.999999.
>  *
>  * Please use the singleton `DataTypes.TimeType` to refer the type.
>  */
> class TimeType(precisionField: Byte) private () extends DatetimeType {
>   /**
>    * The default size of a value of the TimeType is 8 bytes.
>    */
>   override def defaultSize: Int = 8
>   private[spark] override def asNullable: DateType = this
> }
> {code}
> *Appendix B:* As the external types for the new TIME type, we propose:
>  - Java/Scala: 
> [java.time.LocalTime|https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/time/LocalTime.html]
>  - PySpark: 
> [time|https://docs.python.org/3/library/datetime.html#time-objects]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to