I would suggest that making timestamp type behavior configurable and persisted per-table could introduce some real confusion, e.g. in queries involving tables with different timestamp type semantics.
I suggest starting with the assumption that timestamp type behavior is a per-session flag that can be set in a global `spark-defaults.conf` and consider more granular levels of configuration as people identify solid use cases. Cheers, Michael > On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com> wrote: > > Hi, > > If I remember correctly, the TIMESTAMP type had UTC-normalized local time > semantics even before Spark 2, so I can understand that Spark considers it to > be the "established" behavior that must not be broken. Unfortunately, this > behavior does not provide interoperability with other SQL engines of the > Hadoop stack. > > Let me summarize the findings of this e-mail thread so far: > Timezone-agnostic TIMESTAMP semantics would be beneficial for > interoperability and SQL compliance. > Spark can not make a breaking change. For backward-compatibility with > existing data, timestamp semantics should be user-configurable on a per-table > level. > Before going into the specifics of a possible solution, do we all agree on > these points? > > Thanks, > > Zoltan > > On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com > <mailto:iras...@cloudera.com>> wrote: > I had asked zoltan to bring this discussion to the dev list because I think > it's a question that extends beyond a single jira (we can't figure out the > semantics of timestamp in parquet if we don't k ow the overall goal of the > timestamp type) and since its a design question the entire community should > be involved. > > I think that a lot of the confusion comes because we're talking about > different ways time zone affect behavior: (1) parsing and (2) behavior when > changing time zones for processing data. > > It seems we agree that spark should eventually provide a timestamp type which > does conform to the standard. The question is, how do we get there? Has > spark already broken compliance so much that it's impossible to go back > without breaking user behavior? Or perhaps spark already has inconsistent > behavior / broken compatibility within the 2.x line, so its not unthinkable > to have another breaking change? > > (Another part of the confusion is on me -- I believed the behavior change was > in 2.2, but actually it looks like its in 2.0.1. That changes how we think > about this in context of what goes into a 2.2 release. SPARK-18350 isn't the > origin of the difference in behavior.) > > First: consider processing data that is already stored in tables, and then > accessing it from machines in different time zones. The standard is clear > that "timestamp" should be just like "timestamp without time zone": it does > not represent one instant in time, rather it's always displayed the same, > regardless of time zone. This was the behavior in spark 2.0.0 (and 1.6), > for hive tables stored as text files, and for spark's json formats. > > Spark 2.0.1 changed the behavior of the json format (I believe with > SPARK-16216), so that it behaves more like timestamp *with* time zone. It > also makes csv behave the same (timestamp in csv was basically broken in > 2.0.0). However it did *not* change the behavior of a hive textfile; it > still behaves like "timestamp with*out* time zone". Here's some experiments > I tried -- there are a bunch of files there for completeness, but mostly > focus on the difference between query_output_2_0_0.txt vs. > query_output_2_0_1.txt > > https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 > <https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70> > > Given that spark has changed this behavior post 2.0.0, is it still out of the > question to change this behavior to bring it back in line with the sql > standard for timestamp (without time zone) in the 2.x line? Or, as reynold > proposes, is the only option at this point to add an off-by-default feature > flag to get "timestamp without time zone" semantics? > > > Second, there is the question of parsing strings into timestamp type. I'm > far less knowledgeable about this, so I mostly just have questions: > > * does the standard dictate what the parsing behavior should be for timestamp > (without time zone) when a time zone is present? > > * if it does and spark violates this standard is it worth trying to retain > the *other* semantics of timestamp without time zone, even if we violate the > parsing part? > > I did look at what postgres does for comparison: > > https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c > <https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c> > > spark's timestamp certainly does not match postgres's timestamp for parsing, > it seems closer to postgres's "timestamp with timezone" -- though I dunno if > that is standard behavior at all. > > thanks, > Imran > > On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com > <mailto:r...@databricks.com>> wrote: > That's just my point 4, isn't it? > > > On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io > <mailto:ofir.ma...@equalum.io>> wrote: > Reynold, > my point is that Spark should aim to follow the SQL standard instead of > rolling its own type system. > If I understand correctly, the existing implementation is similar to > TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. > In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE > data types which are missing from Spark. > So, it is better (for me) if instead of extending the existing types, Spark > would just implement the additional well-defined types properly. > Just trying to copy-paste CREATE TABLE between SQL engines should not be an > exercise of flags and incompatibilities. > > Regarding the current behaviour, if I remember correctly I had to force our > spark O/S user into UTC so Spark wont change my timestamps. > > Ofir Manor > > Co-Founder & CTO | Equalum > > > Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: > ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io> > On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com > <mailto:r...@databricks.com>> wrote: > Zoltan, > > Thanks for raising this again, although I'm a bit confused since I've > communicated with you a few times on JIRA and on private emails to explain > that you have some misunderstanding of the timestamp type in Spark and some > of your statements are wrong (e.g. the except text file part). Not sure why > you didn't get any of those. > > > Here's another try: > > > 1. I think you guys misunderstood the semantics of timestamp in Spark before > session local timezone change. IIUC, Spark has always assumed timestamps to > be with timezone, since it parses timestamps with timezone and does all the > datetime conversions with timezone in mind (it doesn't ignore timezone if a > timestamp string has timezone specified). The session local timezone change > further pushes Spark to that direction, but the semantics has been with > timezone before that change. Just run Spark on machines with different > timezone and you will know what I'm talking about. > > 2. CSV/Text is not different. The data type has always been "with timezone". > If you put a timezone in the timestamp string, it parses the timezone. > > 3. We can't change semantics now, because it'd break all existing Spark apps. > > 4. We can however introduce a new timestamp without timezone type, and have a > config flag to specify which one (with tz or without tz) is the default > behavior. > > > > On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com > <mailto:z...@cloudera.com>> wrote: > Hi, > > Sorry if you receive this mail twice, it seems that my first attempt did not > make it to the list for some reason. > > I would like to start a discussion about SPARK-18350 > <https://issues.apache.org/jira/browse/SPARK-18350> before it gets released > because it seems to be going in a different direction than what other SQL > engines of the Hadoop stack do. > > ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME > ZONE) to have timezone-agnostic semantics - basically a type that expresses > readings from calendars and clocks and is unaffected by time zone. In the > Hadoop stack, Impala has always worked like this and recently Presto also > took steps <https://github.com/prestodb/presto/issues/7122> to become > standards compliant. (Presto's design doc > <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> > also contains a great summary of the different semantics.) Hive has a > timezone-agnostic TIMESTAMP type as well (except for Parquet, a major source > of incompatibility that is already being addressed > <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in SparkSQL, > however, has UTC-normalized local time semantics (except for textfile), which > is generally the semantics of the TIMESTAMP WITH TIME ZONE type. > > Given that timezone-agnostic TIMESTAMP semantics provide standards compliance > and consistency with most SQL engines, I was wondering whether SparkSQL > should also consider it in order to become ANSI SQL compliant and > interoperable with other SQL engines of the Hadoop stack. Should SparkSQL > adapt this semantics in the future, SPARK-18350 > <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a > source of problems. Please correct me if I'm wrong, but this change seems to > explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP type. I > think SPARK-18350 would be a great feature for a separate TIMESTAMP WITH TIME > ZONE type, but the plain unqualified TIMESTAMP type would be better becoming > timezone-agnostic instead of gaining further timezone-aware capabilities. (Of > course becoming timezone-agnostic would be a behavior change, so it must be > optional and configurable by the user, as in Presto.) > > I would like to hear your opinions about this concern and about TIMESTAMP > semantics in general. Does the community agree that a standards-compliant and > interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as a > potential problem in achieving this or do I misunderstand the effects of this > change? > > Thanks, > > Zoltan > > --- > > List of links in case in-line links do not work: > SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 > <https://issues.apache.org/jira/browse/SPARK-18350> > Presto's change: https://github.com/prestodb/presto/issues/7122 > <https://github.com/prestodb/presto/issues/7122> > Presto's design doc: > https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit > > <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> > > > >