Re: SQL TIMESTAMP semantics vs. SPARK-18350

Zoltan Ivanfi Thu, 25 May 2017 05:51:43 -0700

Hi,

Ofir, thanks for your support. My understanding is that many users have the
same problem as you do.

Reynold, thanks for your reply and sorry for the confusion. My personal
e-mail was specifically about your concerns regarding SPARK-12297 and I
started this separate thread because this is about the general vision
regarding the TIMESTAMP type which may be of interest to the whole
community. My initial e-mail did not address your concerns because I wrote
it before you answered on the other thread.

Regarding your specific concerns:

1. I realize that the TIMESTAMP type in Spark already has UTC-normalized
local time semantics, but I believe that this is problematic for
consistency and interoperability with other SQL engines. In my opinion a
standard-compliant behavior would be the best and since SPARK-18350 takes
SparkSQL even further away from it, I am worried that it makes fixing this
incompatibility even harder.

2. If a timezone is present in a textfile, SparkSQL can parse it indeed.
However, if there is no specific timezone mentioned, it will parse the
TIMESTAMP as a local time, and when the result is displayed to the user
(without the timezone), it will be identical regardless of the current
timezone. This actually matches the way how Hive approximates a
timezone-agnostic TIMESTAMP behavior. Since Hive's in-memory timestamp
representation is UTC-normalized local time (similar to Spark), reading
timestamps in different timezones will result in a different UTC value in
the in-memory representation. However, when they are rendered, they will
look the same, so the apparent behavior will match the desired
timezone-agnostic semantics. (The reason why this is only an approximation
is that timestamps skipped due to DST changes can not be represented this
way.)

But even if we consider textfile to be no exception, it is still not
SQL-compliant that the TIMESTAMP type has TIMESTAMP WITH TIME ZONE
semantics.

3. I agree that Spark must not break compatibility in the interpretation of
already existing data, but I don't think that it means that we can't change
semantics now. It just means that we have to make it configurable, as I
suggested in the initial mail of this thread.

Actually, the requirement of never breaking compatibility is the exact
reason why I'm worried about SPARK-18350, since if people start using that
feature, it will be even harder to change semantics while keeping
compatibility at the same time. (On the other hand, SPARK-18350 would be an
essential feature for a separate TIMESTAMP WITH TIME ZONE type.)

4. The ability to choose the desired behavior of a TIMESTAMP as you suggest
actually solves the problem of breaking compatibility. However, I don't
think that a central configuration flag is enough. Since users who already
have timestamp data may also want to have standard-compliant behavior for
new tables, I think there needs to be a table-specific override for the
global configuration flag. In fact, that is what we wanted to achieve in
SPARK-12297, although our effort was limited to the Parquet format.

Zoltan

On Thu, May 25, 2017 at 12:33 PM Reynold Xin <r...@databricks.com> wrote:

> Zoltan,
>
> Thanks for raising this again, although I'm a bit confused since I've
> communicated with you a few times on JIRA and on private emails to explain
> that you have some misunderstanding of the timestamp type in Spark and some
> of your statements are wrong (e.g. the except text file part). Not sure why
> you didn't get any of those.
>
>
> Here's another try:
>
>
> 1. I think you guys misunderstood the semantics of timestamp in Spark
> before session local timezone change. IIUC, Spark has always assumed
> timestamps to be with timezone, since it parses timestamps with timezone
> and does all the datetime conversions with timezone in mind (it doesn't
> ignore timezone if a timestamp string has timezone specified). The session
> local timezone change further pushes Spark to that direction, but the
> semantics has been with timezone before that change. Just run Spark on
> machines with different timezone and you will know what I'm talking about.
>
> 2. CSV/Text is not different. The data type has always been "with
> timezone". If you put a timezone in the timestamp string, it parses the
> timezone.
>
> 3. We can't change semantics now, because it'd break all existing Spark
> apps.
>
> 4. We can however introduce a new timestamp without timezone type, and
> have a config flag to specify which one (with tz or without tz) is the
> default behavior.
>
>
>
> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>
>> Hi,
>>
>> Sorry if you receive this mail twice, it seems that my first attempt did
>> not make it to the list for some reason.
>>
>> I would like to start a discussion about SPARK-18350
>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
>> released because it seems to be going in a different direction than what
>> other SQL engines of the Hadoop stack do.
>>
>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
>> ZONE) to have timezone-agnostic semantics - basically a type that expresses
>> readings from calendars and clocks and is unaffected by time zone. In the
>> Hadoop stack, Impala has always worked like this and recently Presto also
>> took steps <https://github.com/prestodb/presto/issues/7122> to become
>> standards compliant. (Presto's design doc
>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>> also contains a great summary of the different semantics.) Hive has a
>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>> source of incompatibility that is already being addressed
>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
>> SparkSQL, however, has UTC-normalized local time semantics (except for
>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>> type.
>>
>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>> compliance and consistency with most SQL engines, I was wondering whether
>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>> adapt this semantics in the future, SPARK-18350
>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a
>> source of problems. Please correct me if I'm wrong, but this change seems
>> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
>> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
>> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
>> better becoming timezone-agnostic instead of gaining further timezone-aware
>> capabilities. (Of course becoming timezone-agnostic would be a behavior
>> change, so it must be optional and configurable by the user, as in Presto.)
>>
>> I would like to hear your opinions about this concern and about TIMESTAMP
>> semantics in general. Does the community agree that a standards-compliant
>> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
>> a potential problem in achieving this or do I misunderstand the effects of
>> this change?
>>
>> Thanks,
>>
>> Zoltan
>>
>> ---
>>
>> List of links in case in-line links do not work:
>>
>>    -
>>
>>    SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>>    -
>>
>>    Presto's change: https://github.com/prestodb/presto/issues/7122
>>    -
>>
>>    Presto's design doc:
>>    
>> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>
>>
>>
>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Reply via email to