Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Alastair Green Fri, 06 Sep 2019 03:12:49 -0700

Makes sense.
While the ISO SQL standard automatically becomes an American national (ANSI) 
standard, changes are only made to the International (ISO/IEC) Standard, which 
is the authoritative specification.
These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section 9.2.
Could we rename the proposed default to “ISO/IEC (ANSI)”?
— Alastair


On Thu, Sep 5, 2019 at 17:17, Reynold Xin <r...@databricks.com> wrote:
Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com 
[cloud0...@gmail.com] > wrote:
+1
To be honest I don't like the legacy policy. It's too loose and easy for users 
to make mistakes, especially when Spark returns null if a function hit errors 
like overflow.
The strict policy is not good either. It's too strict and stops valid use cases 
like writing timestamp values to a date type column. Users do expect truncation 
to happen without adding cast manually in this case. It's also weird to use a 
spark specific policy that no other database is using.
The ANSI policy is better. It stops invalid use cases like writing string 
values to an int type column, while keeping valid use cases like timestamp -> 
date.
I think it's no doubt that we should use ANSI policy instead of legacy policy 
for v1 tables. Except for backward compatibility, ANSI policy is literally 
better than the legacy policy.
The v2 table is arguable here. Although the ANSI policy is better than strict 
policy to me, this is just the store assignment policy, which only partially 
controls the table insertion behavior. With Spark's "return null on error" 
behavior, the table insertion is more likely to insert invalid null values with 
the ANSI policy compared to the strict policy.
I think we should use ANSI policy by default for both v1 and v2 tables, because 
1. End-users don't care how the table is implemented. Spark should provide 
consistent table insertion behavior between v1 and v2 tables. 2. Data Source V2 
is unstable in Spark 2.x so there is no backward compatibility issue. That 
said, the baseline to judge which policy is better should be the table 
insertion behavior in Spark 2.x, which is the legacy policy + "return null on 
error". ANSI policy is better than the baseline. 3. We expect more and more 
uses to migrate their data sources to the V2 API. The strict policy can be a 
stopper as it's a too big breaking change, which may break many existing 
queries.
Thanks, Wenchen


On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks. 
com [gengliang.w...@databricks.com] > wrote:
Hi everyone,

I'd like to call for a vote on SPARK-28885 
[https://issues.apache.org/jira/browse/SPARK-28885] "Follow ANSI store 
assignment rules in table insertion by default".  
When inserting a value into a column with the different data type, Spark 
performs type coercion. Currently, we support 3 policies for the type coercion 
rules: ANSI, legacy and strict, which can be set via the option 
"spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
type conversions such as converting `string` to `int` and `double` to `boolean`.
2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
which is very loose. E.g., converting either `string` to `int` or `double` to 
`boolean` is allowed. It is the current behavior in Spark 2.x for compatibility 
with Hive.
3. Strict: Spark doesn't allow any possible precision loss or data truncation 
in type coercion, e.g., converting either `double` to `int` or `decimal` to 
`double` is allowed. The rules are originally for Dataset encoder. As far as I 
know, no maintainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
"Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
in Spark 3.0.

There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the dev 
mailing list.

This vote is open until next Thurs (Sept. 12nd).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Reply via email to