Re: [Discuss] Follow ANSI SQL on table insertion

Hyukjin Kwon Mon, 29 Jul 2019 23:33:18 -0700

>From my look, +1 on the proposal, considering ASCI and other DBMSes in
general.


2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0...@gmail.com>님이 작성:

> We can add a config for a certain behavior if it makes sense, but the most
> important thing we want to reach an agreement here is: what should be the
> default behavior?
>
> Let's explore the solution space of table insertion behavior first:
> At compile time,
> 1. always add cast
> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
> int is forbidden but long to int is allowed)
> 3. only add cast if it's 100% safe
> At runtime,
> 1. return null for invalid operations
> 2. throw exceptions at runtime for invalid operations
>
> The standards to evaluate a solution:
> 1. How robust the query execution is. For example, users usually don't
> want to see the query fails midway.
> 2. how tolerant to user queries. For example, a user would like to write
> long values to an int column as he knows all the long values won't exceed
> int range.
> 3. How clean the result is. For example, users usually don't want to see
> silently corrupted data (null values).
>
> The current Spark behavior for Data Source V1 tables: always add cast and
> return null for invalid operations. This maximizes standard 1 and 2, but
> the result is least clean and users are very likely to see silently
> corrupted data (null values).
>
> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
> queries may fail to compile, even if these queries can run on other SQL
> systems. Note that, people can still see silently corrupted data because
> cast is not the only one that can return corrupted data. Simple operations
> like ADD can also return corrected data if overflow happens. e.g. INSERT
> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2
>
> The proposal here: add cast following ANSI SQL store assignment rule, and
> return null for invalid operations. This maximizes standard 1, and also
> fits standard 2 well: if a query can't compile in Spark, it usually can't
> compile in other mainstream databases as well. I think that's tolerant
> enough. For standard 3, this proposal doesn't maximize it but can avoid
> many invalid operations already.
>
> Technically we can't make the result 100% clean at compile-time, we have
> to handle things like overflow at runtime. I think the new proposal makes
> more sense as the default behavior.
>
>
> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I understand spark is making the decisions, i'm say the actual final
>> effect of the null decision would be different depending on the insertion
>> target if the target has different behaviors for null.
>>
>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> > I'm a big -1 on null values for invalid casts.
>>>
>>> This is why we want to introduce the ANSI mode, so that invalid cast
>>> fails at runtime. But we have to keep the null behavior for a while, to
>>> keep backward compatibility. Spark returns null for invalid cast since the
>>> first day of Spark SQL, we can't just change it without a way to restore to
>>> the old behavior.
>>>
>>> I'm OK with adding a strict mode for the upcast behavior in table
>>> insertion, but I don't agree with making it the default. The default
>>> behavior should be either the ANSI SQL behavior or the legacy Spark
>>> behavior.
>>>
>>> > other modes should be allowed only with strict warning the behavior
>>> will be determined by the underlying sink.
>>>
>>> Seems there is some misunderstanding. The table insertion behavior is
>>> fully controlled by Spark. Spark decides when to add cast and Spark decided
>>> whether invalid cast should return null or fail. The sink is only
>>> responsible for writing data, not the type coercion/cast stuff.
>>>
>>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I'm a big -1 on null values for invalid casts. This can lead to a lot
>>>> of even more unexpected errors and runtime behavior since null is
>>>>
>>>> 1. Not allowed in all schemas (Leading to a runtime error anyway)
>>>> 2. Is the same as delete in some systems (leading to data loss)
>>>>
>>>> And this would be dependent on the sink being used. Spark won't just be
>>>> interacting with ANSI compliant sinks so I think it makes much more sense
>>>> to be strict. I think Upcast mode is a sensible default and other modes
>>>> should be allowed only with strict warning the behavior will be determined
>>>> by the underlying sink.
>>>>
>>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <linguin....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, all
>>>>>
>>>>> +1 for implementing this new store cast mode.
>>>>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs
>>>>> and I think this functionality could
>>>>> promote migrations from existing DBMSs to Spark.
>>>>>
>>>>> The most important thing for DBMS users is that they could optionally
>>>>> choose this mode when inserting data.
>>>>> Therefore, I think it might be okay that the two modes (the current
>>>>> upcast mode and the proposed store cast mode)
>>>>> co-exist for INSERTs. (There is a room to discuss which mode  is
>>>>> enabled by default though...)
>>>>>
>>>>> IMHO we'll provide three behaviours below for INSERTs;
>>>>>  - upcast mode
>>>>>  - ANSI store cast mode and runtime exceptions thrown for invalid
>>>>> values
>>>>>  - ANSI store cast mode and null filled for invalid values
>>>>>
>>>>>
>>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang <
>>>>> gengliang.w...@databricks.com> wrote:
>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> Thanks for the suggestions on the proposal and doc.
>>>>>> Currently, there is no data type validation in table insertion of V1.
>>>>>> We are on the same page that we should improve it. But using UpCast is 
>>>>>> from
>>>>>> one extreme to another. It is possible that many queries are broken after
>>>>>> upgrading to Spark 3.0.
>>>>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning
>>>>>> Timestamp type to Date Type, as there will be "precision loss". To me, 
>>>>>> the
>>>>>> type coercion is reasonable and the "precision loss" is under 
>>>>>> expectation.
>>>>>> This is very common in other SQL engines.
>>>>>> As long as Spark is following the ANSI SQL store assignment rules, it
>>>>>> is users' responsibility to take good care of the type coercion in data
>>>>>> writing. I think it's the right decision.
>>>>>>
>>>>>> > But the new behavior is only applied in DataSourceV2, so it won’t
>>>>>> affect existing jobs until sources move to v2 and break other behavior
>>>>>> anyway.
>>>>>> Eventually, most sources are supposed to be migrated to DataSourceV2
>>>>>> V2. I think we can discuss and make a decision now.
>>>>>>
>>>>>> > Fixing the silent corruption by adding a runtime exception is not a
>>>>>> good option, either.
>>>>>> The new optional mode proposed in
>>>>>> https://issues.apache.org/jira/browse/SPARK-28512 is disabled by
>>>>>> default. This should be fine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I don't agree with handling literal values specially. Although
>>>>>>> Postgres does it, I can't find anything about it in the SQL standard. 
>>>>>>> And
>>>>>>> it introduces inconsistent behaviors which may be strange to users:
>>>>>>> * What about something like "INSERT INTO t SELECT float_col + 1.1"?
>>>>>>> * The same insert with a decimal column as input will fail even when
>>>>>>> a decimal literal would succeed
>>>>>>> * Similar insert queries with "literal" inputs can be constructed
>>>>>>> through layers of indirection via views, inline views, CTEs, unions, 
>>>>>>> etc.
>>>>>>> Would those decimals be treated as columns and fail or would we attempt 
>>>>>>> to
>>>>>>> make them succeed as well? Would users find this behavior surprising?
>>>>>>>
>>>>>>> Silently corrupt data is bad, but this is the decision we made at
>>>>>>> the beginning when design Spark behaviors. Whenever an error occurs, 
>>>>>>> Spark
>>>>>>> attempts to return null instead of runtime exception. Recently we 
>>>>>>> provide
>>>>>>> configs to make Spark fail at runtime for overflow, but that's another
>>>>>>> story. Silently corrupt data is bad, runtime exception is bad, and
>>>>>>> forbidding all the table insertions that may fail(even with very little
>>>>>>> possibility) is also bad. We have to make trade-offs. The trade-offs we
>>>>>>> made in this proposal are:
>>>>>>> * forbid table insertions that are very like to fail, at compile
>>>>>>> time. (things like writing string values to int column)
>>>>>>> * allow table insertions that are not that likely to fail. If the
>>>>>>> data is wrong, don't fail, insert null.
>>>>>>> * provide a config to fail the insertion at runtime if the data is
>>>>>>> wrong.
>>>>>>>
>>>>>>> >  But the new behavior is only applied in DataSourceV2, so it won’t
>>>>>>> affect existing jobs until sources move to v2 and break other behavior
>>>>>>> anyway.
>>>>>>> When users write SQL queries, they don't care if a table is backed
>>>>>>> by Data Source V1 or V2. We should make sure the table insertion 
>>>>>>> behavior
>>>>>>> is consistent and reasonable. Furthermore, users may even not care if 
>>>>>>> the
>>>>>>> SQL queries are run in Spark or other RDBMS, it's better to follow SQL
>>>>>>> standard instead of introducing a Spark-specific behavior.
>>>>>>>
>>>>>>> We are not talking about a small use case like allowing writing
>>>>>>> decimal literal to float column, we are talking about a big goal to make
>>>>>>> Spark compliant to SQL standard, w.r.t.
>>>>>>> https://issues.apache.org/jira/browse/SPARK-26217 . This proposal
>>>>>>> is a sub-task of it, to make the table insertion behavior follow SQL
>>>>>>> standard.
>>>>>>>
>>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>>> I don’t think this is a good idea. Following the ANSI standard is
>>>>>>>> usually fine, but here it would *silently corrupt data*.
>>>>>>>>
>>>>>>>> From your proposal doc, ANSI allows implicitly casting from long
>>>>>>>> to int (any numeric type to any other numeric type) and inserts
>>>>>>>> NULL when a value overflows. That would drop data values and is not 
>>>>>>>> safe.
>>>>>>>>
>>>>>>>> Fixing the silent corruption by adding a runtime exception is not a
>>>>>>>> good option, either. That puts off the problem until much of the job 
>>>>>>>> has
>>>>>>>> completed, instead of catching the error at analysis time. It is 
>>>>>>>> better to
>>>>>>>> catch this earlier during analysis than to run most of a job and then 
>>>>>>>> fail.
>>>>>>>>
>>>>>>>> In addition, part of the justification for using the ANSI standard
>>>>>>>> is to avoid breaking existing jobs. But the new behavior is only 
>>>>>>>> applied in
>>>>>>>> DataSourceV2, so it won’t affect existing jobs until sources move to 
>>>>>>>> v2 and
>>>>>>>> break other behavior anyway.
>>>>>>>>
>>>>>>>> I think that the correct solution is to go with the existing
>>>>>>>> validation rules that require explicit casts to truncate values.
>>>>>>>>
>>>>>>>> That still leaves the use case that motivated this proposal, which
>>>>>>>> is that floating point literals are parsed as decimals and fail simple
>>>>>>>> insert statements. We already came up with two alternatives to fix that
>>>>>>>> problem in the DSv2 sync and I think it is a better idea to go with 
>>>>>>>> one of
>>>>>>>> those instead of “fixing” Spark in a way that will corrupt data or 
>>>>>>>> cause
>>>>>>>> runtime failures.
>>>>>>>>
>>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have heard about many complaints about the old table insertion
>>>>>>>>> behavior. Blindly casting everything will leak the user mistake to a 
>>>>>>>>> late
>>>>>>>>> stage of the data pipeline, and make it very hard to debug. When a 
>>>>>>>>> user
>>>>>>>>> writes string values to an int column, it's probably a mistake and the
>>>>>>>>> columns are misordered in the INSERT statement. We should fail the 
>>>>>>>>> query
>>>>>>>>> earlier and ask users to fix the mistake.
>>>>>>>>>
>>>>>>>>> In the meanwhile, I agree that the new table insertion behavior we
>>>>>>>>> introduced for Data Source V2 is too strict. It may fail valid queries
>>>>>>>>> unexpectedly.
>>>>>>>>>
>>>>>>>>> In general, I support the direction of following the ANSI SQL
>>>>>>>>> standard. But I'd like to do it with 2 steps:
>>>>>>>>> 1. only add cast when the assignment rule is satisfied. This
>>>>>>>>> should be the default behavior and we should provide a legacy config 
>>>>>>>>> to
>>>>>>>>> restore to the old behavior.
>>>>>>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK
>>>>>>>>> Marco Gaido is working on it already. This will have a config as well 
>>>>>>>>> and
>>>>>>>>> by default we still return null.
>>>>>>>>>
>>>>>>>>> After doing this, the default behavior will be slightly different
>>>>>>>>> from the SQL standard (cast can return null), and users can turn on 
>>>>>>>>> the
>>>>>>>>> ANSI mode to fully follow the SQL standard. This is much better than 
>>>>>>>>> before
>>>>>>>>> and should prevent a lot of user mistakes. It's also a reasonable 
>>>>>>>>> choice to
>>>>>>>>> me to not throw exceptions at runtime by default, as it's usually bad 
>>>>>>>>> for
>>>>>>>>> long-running jobs.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wenchen
>>>>>>>>>
>>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>>>>>>> gengliang.w...@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> I would like to discuss the table insertion behavior of Spark. In
>>>>>>>>>> the current data source V2, only UpCast is allowed for table 
>>>>>>>>>> insertion. I
>>>>>>>>>> think following ANSI SQL is a better idea.
>>>>>>>>>> For more information, please read the Discuss: Follow ANSI SQL
>>>>>>>>>> on table insertion
>>>>>>>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>>>>>>>>>> Please let me know if you have any thoughts on this.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Gengliang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>

Re: [Discuss] Follow ANSI SQL on table insertion

Reply via email to