That's just my point 4, isn't it?
On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io> wrote: > Reynold, > my point is that Spark should aim to follow the SQL standard instead of > rolling its own type system. > If I understand correctly, the existing implementation is similar to > TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. > In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE > data types which are missing from Spark. > So, it is better (for me) if instead of extending the existing types, > Spark would just implement the additional well-defined types properly. > Just trying to copy-paste CREATE TABLE between SQL engines should not be > an exercise of flags and incompatibilities. > > Regarding the current behaviour, if I remember correctly I had to force > our spark O/S user into UTC so Spark wont change my timestamps. > > Ofir Manor > > Co-Founder & CTO | Equalum > > Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io > > On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com> wrote: > >> Zoltan, >> >> Thanks for raising this again, although I'm a bit confused since I've >> communicated with you a few times on JIRA and on private emails to explain >> that you have some misunderstanding of the timestamp type in Spark and some >> of your statements are wrong (e.g. the except text file part). Not sure why >> you didn't get any of those. >> >> >> Here's another try: >> >> >> 1. I think you guys misunderstood the semantics of timestamp in Spark >> before session local timezone change. IIUC, Spark has always assumed >> timestamps to be with timezone, since it parses timestamps with timezone >> and does all the datetime conversions with timezone in mind (it doesn't >> ignore timezone if a timestamp string has timezone specified). The session >> local timezone change further pushes Spark to that direction, but the >> semantics has been with timezone before that change. Just run Spark on >> machines with different timezone and you will know what I'm talking about. >> >> 2. CSV/Text is not different. The data type has always been "with >> timezone". If you put a timezone in the timestamp string, it parses the >> timezone. >> >> 3. We can't change semantics now, because it'd break all existing Spark >> apps. >> >> 4. We can however introduce a new timestamp without timezone type, and >> have a config flag to specify which one (with tz or without tz) is the >> default behavior. >> >> >> >> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com> wrote: >> >>> Hi, >>> >>> Sorry if you receive this mail twice, it seems that my first attempt did >>> not make it to the list for some reason. >>> >>> I would like to start a discussion about SPARK-18350 >>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets >>> released because it seems to be going in a different direction than what >>> other SQL engines of the Hadoop stack do. >>> >>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT >>> TIME ZONE) to have timezone-agnostic semantics - basically a type that >>> expresses readings from calendars and clocks and is unaffected by time >>> zone. In the Hadoop stack, Impala has always worked like this and recently >>> Presto also took steps <https://github.com/prestodb/presto/issues/7122> >>> to become standards compliant. (Presto's design doc >>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> >>> also contains a great summary of the different semantics.) Hive has a >>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major >>> source of incompatibility that is already being addressed >>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in >>> SparkSQL, however, has UTC-normalized local time semantics (except for >>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE >>> type. >>> >>> Given that timezone-agnostic TIMESTAMP semantics provide standards >>> compliance and consistency with most SQL engines, I was wondering whether >>> SparkSQL should also consider it in order to become ANSI SQL compliant and >>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL >>> adapt this semantics in the future, SPARK-18350 >>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be >>> a source of problems. Please correct me if I'm wrong, but this change seems >>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP >>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP >>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be >>> better becoming timezone-agnostic instead of gaining further timezone-aware >>> capabilities. (Of course becoming timezone-agnostic would be a behavior >>> change, so it must be optional and configurable by the user, as in Presto.) >>> >>> I would like to hear your opinions about this concern and about >>> TIMESTAMP semantics in general. Does the community agree that a >>> standards-compliant and interoperable TIMESTAMP type is desired? Do you >>> perceive SPARK-18350 as a potential problem in achieving this or do I >>> misunderstand the effects of this change? >>> >>> Thanks, >>> >>> Zoltan >>> >>> --- >>> >>> List of links in case in-line links do not work: >>> >>> - >>> >>> SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 >>> - >>> >>> Presto's change: https://github.com/prestodb/presto/issues/7122 >>> - >>> >>> Presto's design doc: https://docs.google.com/docume >>> nt/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit >>> >>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> >>> >>> >>> >> >