[ https://issues.apache.org/jira/browse/SPARK-51162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-51162: ---------------------------------- Affects Version/s: 4.1.0 (was: 4.0.0) > SPIP: Add the TIME data type > ---------------------------- > > Key: SPARK-51162 > URL: https://issues.apache.org/jira/browse/SPARK-51162 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 4.1.0 > Reporter: Max Gekk > Assignee: Max Gekk > Priority: Major > Labels: SPIP > > *Q1. What are you trying to do? Articulate your objectives using absolutely > no jargon.* > Add new data type *TIME* to Spark SQL which represents a time value with > fields hour, minute, second, up to microseconds. All operations over the type > are performed without taking any time zone into account. New data type should > conform to the type *TIME\(n\) WITHOUT TIME ZONE* defined by the SQL standard > where 0 <= n <= 6. > *Q2. What problem is this proposal NOT designed to solve?* > Don't support the TIME type with time zone defined by the SQL standard: > {*}TIME\(n\) WITH TIME ZONE{*}. > Also don't support TIME with local timezone. > *Q3. How is it done today, and what are the limits of current practice?* > The TIME type can be emulated via the TIMESTAMP_NTZ data type by setting the > date part to the some constant value like 1970-01-01, 0001-01-01 or > 0000-00-00 (though this is out of supported range of dates). > Although the type can be emulation via TIMESTAMP_NTZ, Spark SQL cannot > recognize it in data sources, and for instance cannot load the TIME values > from parquet files. > *Q4. What is new in your approach and why do you think it will be successful?* > The approach is not new, and we have clear picture how to split the work by > sub-tasks based on our experience of adding new types ANSI intervals and > TIMESTAMP_NTZ. > *Q5. Who cares? If you are successful, what difference will it make?* > The new type simplifies migrations to Spark SQL from other DBMS like > PostgreSQL, Snowflake, Google SQL, Amazon Redshift, Teradata, DB2. Such users > don't have to rewrite their SQL code to emulate the TIME type. Also new > functionality impacts on existing Spark SQL users who need to load data w/ > the TIME values that were stored by other systems. > *Q6. What are the risks?* > Additional handling new type in operators, expression and data sources can > cause performance regressions. Such risk can be compensated by developing > time benchmarks in parallel with supporting new type in different places in > Spark SQL. > > *Q7. How long will it take?* > In total it might take around {*}9 months{*}. The estimation is based on > similar tasks: ANSI intervals (SPARK-27790) and TIMESTAMP_NTZ (SPARK-35662). > We can split the work by function blocks: > # Base functionality - *3 weeks* > Add new type TimeType, forming/parsing time literals, type constructor, and > external types. > # Persistence - *3.5 months* > Ability to create tables of the type TIME, read/write from/to Parquet and > other built-in data types, partitioning, stats, predicate push down. > # Time operators - *2 months* > Arithmetic ops, field extract, sorting, and aggregations. > # Clients support - *1 month* > JDBC, Hive, Thrift server, connect > # PySpark integration - *1 month* > DataFrame support, pandas API, python UDFs, Arrow column vectors > # Docs + testing/benchmarking - *1 month* > *Q8. What are the mid-term and final “exams” to check for success?* > The mid-term is in 4 month: basic functionality, read/write new type to > built-in datasources, basic time operations such as arithmetic ops, casting. > The final "exams" is to support the same functionality as other time types: > TIMESTAMP_NTZ, DATE, TIMESTAMP. > *Appendix A. Proposed API Changes.* > Add new case class *TimeType* to {_}org.apache.spark.sql.types{_}: > {code:scala} > /** > * The time type represents a time value with fields hour, minute, second, up > to microseconds. > * The range of times supported is 00:00:00.000000 to 23:59:59.999999. > * > * Please use the singleton `DataTypes.TimeType` to refer the type. > */ > class TimeType(precisionField: Byte) private () extends DatetimeType { > /** > * The default size of a value of the TimeType is 8 bytes. > */ > override def defaultSize: Int = 8 > private[spark] override def asNullable: DateType = this > } > {code} > *Appendix B:* As the external types for the new TIME type, we propose: > - Java/Scala: > [java.time.LocalTime|https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/time/LocalTime.html] > - PySpark: > [time|https://docs.python.org/3/library/datetime.html#time-objects] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org