Re: SQL TIMESTAMP semantics vs. SPARK-18350

Michael Allman Thu, 01 Jun 2017 16:15:56 -0700

I would suggest that making timestamp type behavior configurable and persisted 
per-table could introduce some real confusion, e.g. in queries involving tables 
with different timestamp type semantics.


I suggest starting with the assumption that timestamp type behavior is a 
per-session flag that can be set in a global `spark-defaults.conf` and consider 
more granular levels of configuration as people identify solid use cases.

Cheers,

Michael


> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
> 
> Hi,
> 
> If I remember correctly, the TIMESTAMP type had UTC-normalized local time 
> semantics even before Spark 2, so I can understand that Spark considers it to 
> be the "established" behavior that must not be broken. Unfortunately, this 
> behavior does not provide interoperability with other SQL engines of the 
> Hadoop stack.
> 
> Let me summarize the findings of this e-mail thread so far:
> Timezone-agnostic TIMESTAMP semantics would be beneficial for 
> interoperability and SQL compliance.
> Spark can not make a breaking change. For backward-compatibility with 
> existing data, timestamp semantics should be user-configurable on a per-table 
> level.
> Before going into the specifics of a possible solution, do we all agree on 
> these points?
> 
> Thanks,
> 
> Zoltan
> 
> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com 
> <mailto:iras...@cloudera.com>> wrote:
> I had asked zoltan to bring this discussion to the dev list because I think 
> it's a question that extends beyond a single jira (we can't figure out the 
> semantics of timestamp in parquet if we don't k ow the overall goal of the 
> timestamp type) and since its a design question the entire community should 
> be involved.
> 
> I think that a lot of the confusion comes because we're talking about 
> different ways time zone affect behavior: (1) parsing and (2) behavior when 
> changing time zones for processing data.
> 
> It seems we agree that spark should eventually provide a timestamp type which 
> does conform to the standard.   The question is, how do we get there?  Has 
> spark already broken compliance so much that it's impossible to go back 
> without breaking user behavior?  Or perhaps spark already has inconsistent 
> behavior / broken compatibility within the 2.x line, so its not unthinkable 
> to have another breaking change?
> 
> (Another part of the confusion is on me -- I believed the behavior change was 
> in 2.2, but actually it looks like its in 2.0.1.  That changes how we think 
> about this in context of what goes into a 2.2 release.  SPARK-18350 isn't the 
> origin of the difference in behavior.)
> 
> First: consider processing data that is already stored in tables, and then 
> accessing it from machines in different time zones.  The standard is clear 
> that "timestamp" should be just like "timestamp without time zone": it does 
> not represent one instant in time, rather it's always displayed the same, 
> regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),  
> for hive tables stored as text files, and for spark's json formats.
> 
> Spark 2.0.1  changed the behavior of the json format (I believe with 
> SPARK-16216), so that it behaves more like timestamp *with* time zone.  It 
> also makes csv behave the same (timestamp in csv was basically broken in 
> 2.0.0).  However it did *not* change the behavior of a hive textfile; it 
> still behaves like "timestamp with*out* time zone".  Here's some experiments 
> I tried -- there are a bunch of files there for completeness, but mostly 
> focus on the difference between query_output_2_0_0.txt vs. 
> query_output_2_0_1.txt
> 
> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 
> <https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70>
> 
> Given that spark has changed this behavior post 2.0.0, is it still out of the 
> question to change this behavior to bring it back in line with the sql 
> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold 
> proposes, is the only option at this point to add an off-by-default feature 
> flag to get "timestamp without time zone" semantics?
> 
> 
> Second, there is the question of parsing strings into timestamp type.   I'm 
> far less knowledgeable about this, so I mostly just have questions:
> 
> * does the standard dictate what the parsing behavior should be for timestamp 
> (without time zone) when a time zone is present?
> 
> * if it does and spark violates this standard is it worth trying to retain 
> the *other* semantics of timestamp without time zone, even if we violate the 
> parsing part?
> 
> I did look at what postgres does for comparison:
> 
> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c 
> <https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c>
> 
> spark's timestamp certainly does not match postgres's timestamp for parsing, 
> it seems closer to postgres's "timestamp with timezone" -- though I dunno if 
> that is standard behavior at all.
> 
> thanks,
> Imran
> 
> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> That's just my point 4, isn't it?
> 
> 
> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io 
> <mailto:ofir.ma...@equalum.io>> wrote:
> Reynold,
> my point is that Spark should aim to follow the SQL standard instead of 
> rolling its own type system.
> If I understand correctly, the existing implementation is similar to 
> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. 
> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE 
> data types which are missing from Spark.
> So, it is better (for me) if instead of extending the existing types, Spark 
> would just implement the additional well-defined types properly.
> Just trying to copy-paste CREATE TABLE between SQL engines should not be an 
> exercise of flags and incompatibilities. 
> 
> Regarding the current behaviour, if I remember correctly I had to force our 
> spark O/S user into UTC so Spark wont change my timestamps.
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: 
> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io>
> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> Zoltan,
> 
> Thanks for raising this again, although I'm a bit confused since I've 
> communicated with you a few times on JIRA and on private emails to explain 
> that you have some misunderstanding of the timestamp type in Spark and some 
> of your statements are wrong (e.g. the except text file part). Not sure why 
> you didn't get any of those.
> 
> 
> Here's another try:
> 
> 
> 1. I think you guys misunderstood the semantics of timestamp in Spark before 
> session local timezone change. IIUC, Spark has always assumed timestamps to 
> be with timezone, since it parses timestamps with timezone and does all the 
> datetime conversions with timezone in mind (it doesn't ignore timezone if a 
> timestamp string has timezone specified). The session local timezone change 
> further pushes Spark to that direction, but the semantics has been with 
> timezone before that change. Just run Spark on machines with different 
> timezone and you will know what I'm talking about.
> 
> 2. CSV/Text is not different. The data type has always been "with timezone". 
> If you put a timezone in the timestamp string, it parses the timezone.
> 
> 3. We can't change semantics now, because it'd break all existing Spark apps.
> 
> 4. We can however introduce a new timestamp without timezone type, and have a 
> config flag to specify which one (with tz or without tz) is the default 
> behavior.
> 
> 
> 
> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com 
> <mailto:z...@cloudera.com>> wrote:
> Hi,
> 
> Sorry if you receive this mail twice, it seems that my first attempt did not 
> make it to the list for some reason.
> 
> I would like to start a discussion about SPARK-18350 
> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets released 
> because it seems to be going in a different direction than what other SQL 
> engines of the Hadoop stack do.
> 
> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME 
> ZONE) to have timezone-agnostic semantics - basically a type that expresses 
> readings from calendars and clocks and is unaffected by time zone. In the 
> Hadoop stack, Impala has always worked like this and recently Presto also 
> took steps <https://github.com/prestodb/presto/issues/7122> to become 
> standards compliant. (Presto's design doc 
> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>  also contains a great summary of the different semantics.) Hive has a 
> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major source 
> of incompatibility that is already being addressed 
> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in SparkSQL, 
> however, has UTC-normalized local time semantics (except for textfile), which 
> is generally the semantics of the TIMESTAMP WITH TIME ZONE type.
> 
> Given that timezone-agnostic TIMESTAMP semantics provide standards compliance 
> and consistency with most SQL engines, I was wondering whether SparkSQL 
> should also consider it in order to become ANSI SQL compliant and 
> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL 
> adapt this semantics in the future, SPARK-18350 
> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a 
> source of problems. Please correct me if I'm wrong, but this change seems to 
> explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP type. I 
> think SPARK-18350 would be a great feature for a separate TIMESTAMP WITH TIME 
> ZONE type, but the plain unqualified TIMESTAMP type would be better becoming 
> timezone-agnostic instead of gaining further timezone-aware capabilities. (Of 
> course becoming timezone-agnostic would be a behavior change, so it must be 
> optional and configurable by the user, as in Presto.)
> 
> I would like to hear your opinions about this concern and about TIMESTAMP 
> semantics in general. Does the community agree that a standards-compliant and 
> interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as a 
> potential problem in achieving this or do I misunderstand the effects of this 
> change?
> 
> Thanks,
> 
> Zoltan
> 
> ---
> 
> List of links in case in-line links do not work:
> SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 
> <https://issues.apache.org/jira/browse/SPARK-18350>
> Presto's change: https://github.com/prestodb/presto/issues/7122 
> <https://github.com/prestodb/presto/issues/7122>
> Presto's design doc: 
> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>  
> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
> 
> 
> 
>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Reply via email to