>From my look, +1 on the proposal, considering ASCI and other DBMSes in general.
2019년 7월 30일 (화) 오후 3:21, Wenchen Fan <cloud0...@gmail.com>님이 작성: > We can add a config for a certain behavior if it makes sense, but the most > important thing we want to reach an agreement here is: what should be the > default behavior? > > Let's explore the solution space of table insertion behavior first: > At compile time, > 1. always add cast > 2. add cast following the ASNI SQL store assignment rule (e.g. string to > int is forbidden but long to int is allowed) > 3. only add cast if it's 100% safe > At runtime, > 1. return null for invalid operations > 2. throw exceptions at runtime for invalid operations > > The standards to evaluate a solution: > 1. How robust the query execution is. For example, users usually don't > want to see the query fails midway. > 2. how tolerant to user queries. For example, a user would like to write > long values to an int column as he knows all the long values won't exceed > int range. > 3. How clean the result is. For example, users usually don't want to see > silently corrupted data (null values). > > The current Spark behavior for Data Source V1 tables: always add cast and > return null for invalid operations. This maximizes standard 1 and 2, but > the result is least clean and users are very likely to see silently > corrupted data (null values). > > The current Spark behavior for Data Source V2 tables (new in Spark 3.0): > only add cast if it's 100% safe. This maximizes standard 1 and 3, but many > queries may fail to compile, even if these queries can run on other SQL > systems. Note that, people can still see silently corrupted data because > cast is not the only one that can return corrupted data. Simple operations > like ADD can also return corrected data if overflow happens. e.g. INSERT > INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2 > > The proposal here: add cast following ANSI SQL store assignment rule, and > return null for invalid operations. This maximizes standard 1, and also > fits standard 2 well: if a query can't compile in Spark, it usually can't > compile in other mainstream databases as well. I think that's tolerant > enough. For standard 3, this proposal doesn't maximize it but can avoid > many invalid operations already. > > Technically we can't make the result 100% clean at compile-time, we have > to handle things like overflow at runtime. I think the new proposal makes > more sense as the default behavior. > > > On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I understand spark is making the decisions, i'm say the actual final >> effect of the null decision would be different depending on the insertion >> target if the target has different behaviors for null. >> >> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> > I'm a big -1 on null values for invalid casts. >>> >>> This is why we want to introduce the ANSI mode, so that invalid cast >>> fails at runtime. But we have to keep the null behavior for a while, to >>> keep backward compatibility. Spark returns null for invalid cast since the >>> first day of Spark SQL, we can't just change it without a way to restore to >>> the old behavior. >>> >>> I'm OK with adding a strict mode for the upcast behavior in table >>> insertion, but I don't agree with making it the default. The default >>> behavior should be either the ANSI SQL behavior or the legacy Spark >>> behavior. >>> >>> > other modes should be allowed only with strict warning the behavior >>> will be determined by the underlying sink. >>> >>> Seems there is some misunderstanding. The table insertion behavior is >>> fully controlled by Spark. Spark decides when to add cast and Spark decided >>> whether invalid cast should return null or fail. The sink is only >>> responsible for writing data, not the type coercion/cast stuff. >>> >>> On Sun, Jul 28, 2019 at 12:24 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I'm a big -1 on null values for invalid casts. This can lead to a lot >>>> of even more unexpected errors and runtime behavior since null is >>>> >>>> 1. Not allowed in all schemas (Leading to a runtime error anyway) >>>> 2. Is the same as delete in some systems (leading to data loss) >>>> >>>> And this would be dependent on the sink being used. Spark won't just be >>>> interacting with ANSI compliant sinks so I think it makes much more sense >>>> to be strict. I think Upcast mode is a sensible default and other modes >>>> should be allowed only with strict warning the behavior will be determined >>>> by the underlying sink. >>>> >>>> On Sat, Jul 27, 2019 at 8:05 AM Takeshi Yamamuro <linguin....@gmail.com> >>>> wrote: >>>> >>>>> Hi, all >>>>> >>>>> +1 for implementing this new store cast mode. >>>>> From a viewpoint of DBMS users, this cast is pretty common for INSERTs >>>>> and I think this functionality could >>>>> promote migrations from existing DBMSs to Spark. >>>>> >>>>> The most important thing for DBMS users is that they could optionally >>>>> choose this mode when inserting data. >>>>> Therefore, I think it might be okay that the two modes (the current >>>>> upcast mode and the proposed store cast mode) >>>>> co-exist for INSERTs. (There is a room to discuss which mode is >>>>> enabled by default though...) >>>>> >>>>> IMHO we'll provide three behaviours below for INSERTs; >>>>> - upcast mode >>>>> - ANSI store cast mode and runtime exceptions thrown for invalid >>>>> values >>>>> - ANSI store cast mode and null filled for invalid values >>>>> >>>>> >>>>> On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang < >>>>> gengliang.w...@databricks.com> wrote: >>>>> >>>>>> Hi Ryan, >>>>>> >>>>>> Thanks for the suggestions on the proposal and doc. >>>>>> Currently, there is no data type validation in table insertion of V1. >>>>>> We are on the same page that we should improve it. But using UpCast is >>>>>> from >>>>>> one extreme to another. It is possible that many queries are broken after >>>>>> upgrading to Spark 3.0. >>>>>> The rules of UpCast are too strict. E.g. it doesn't allow assigning >>>>>> Timestamp type to Date Type, as there will be "precision loss". To me, >>>>>> the >>>>>> type coercion is reasonable and the "precision loss" is under >>>>>> expectation. >>>>>> This is very common in other SQL engines. >>>>>> As long as Spark is following the ANSI SQL store assignment rules, it >>>>>> is users' responsibility to take good care of the type coercion in data >>>>>> writing. I think it's the right decision. >>>>>> >>>>>> > But the new behavior is only applied in DataSourceV2, so it won’t >>>>>> affect existing jobs until sources move to v2 and break other behavior >>>>>> anyway. >>>>>> Eventually, most sources are supposed to be migrated to DataSourceV2 >>>>>> V2. I think we can discuss and make a decision now. >>>>>> >>>>>> > Fixing the silent corruption by adding a runtime exception is not a >>>>>> good option, either. >>>>>> The new optional mode proposed in >>>>>> https://issues.apache.org/jira/browse/SPARK-28512 is disabled by >>>>>> default. This should be fine. >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I don't agree with handling literal values specially. Although >>>>>>> Postgres does it, I can't find anything about it in the SQL standard. >>>>>>> And >>>>>>> it introduces inconsistent behaviors which may be strange to users: >>>>>>> * What about something like "INSERT INTO t SELECT float_col + 1.1"? >>>>>>> * The same insert with a decimal column as input will fail even when >>>>>>> a decimal literal would succeed >>>>>>> * Similar insert queries with "literal" inputs can be constructed >>>>>>> through layers of indirection via views, inline views, CTEs, unions, >>>>>>> etc. >>>>>>> Would those decimals be treated as columns and fail or would we attempt >>>>>>> to >>>>>>> make them succeed as well? Would users find this behavior surprising? >>>>>>> >>>>>>> Silently corrupt data is bad, but this is the decision we made at >>>>>>> the beginning when design Spark behaviors. Whenever an error occurs, >>>>>>> Spark >>>>>>> attempts to return null instead of runtime exception. Recently we >>>>>>> provide >>>>>>> configs to make Spark fail at runtime for overflow, but that's another >>>>>>> story. Silently corrupt data is bad, runtime exception is bad, and >>>>>>> forbidding all the table insertions that may fail(even with very little >>>>>>> possibility) is also bad. We have to make trade-offs. The trade-offs we >>>>>>> made in this proposal are: >>>>>>> * forbid table insertions that are very like to fail, at compile >>>>>>> time. (things like writing string values to int column) >>>>>>> * allow table insertions that are not that likely to fail. If the >>>>>>> data is wrong, don't fail, insert null. >>>>>>> * provide a config to fail the insertion at runtime if the data is >>>>>>> wrong. >>>>>>> >>>>>>> > But the new behavior is only applied in DataSourceV2, so it won’t >>>>>>> affect existing jobs until sources move to v2 and break other behavior >>>>>>> anyway. >>>>>>> When users write SQL queries, they don't care if a table is backed >>>>>>> by Data Source V1 or V2. We should make sure the table insertion >>>>>>> behavior >>>>>>> is consistent and reasonable. Furthermore, users may even not care if >>>>>>> the >>>>>>> SQL queries are run in Spark or other RDBMS, it's better to follow SQL >>>>>>> standard instead of introducing a Spark-specific behavior. >>>>>>> >>>>>>> We are not talking about a small use case like allowing writing >>>>>>> decimal literal to float column, we are talking about a big goal to make >>>>>>> Spark compliant to SQL standard, w.r.t. >>>>>>> https://issues.apache.org/jira/browse/SPARK-26217 . This proposal >>>>>>> is a sub-task of it, to make the table insertion behavior follow SQL >>>>>>> standard. >>>>>>> >>>>>>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>>> >>>>>>>> I don’t think this is a good idea. Following the ANSI standard is >>>>>>>> usually fine, but here it would *silently corrupt data*. >>>>>>>> >>>>>>>> From your proposal doc, ANSI allows implicitly casting from long >>>>>>>> to int (any numeric type to any other numeric type) and inserts >>>>>>>> NULL when a value overflows. That would drop data values and is not >>>>>>>> safe. >>>>>>>> >>>>>>>> Fixing the silent corruption by adding a runtime exception is not a >>>>>>>> good option, either. That puts off the problem until much of the job >>>>>>>> has >>>>>>>> completed, instead of catching the error at analysis time. It is >>>>>>>> better to >>>>>>>> catch this earlier during analysis than to run most of a job and then >>>>>>>> fail. >>>>>>>> >>>>>>>> In addition, part of the justification for using the ANSI standard >>>>>>>> is to avoid breaking existing jobs. But the new behavior is only >>>>>>>> applied in >>>>>>>> DataSourceV2, so it won’t affect existing jobs until sources move to >>>>>>>> v2 and >>>>>>>> break other behavior anyway. >>>>>>>> >>>>>>>> I think that the correct solution is to go with the existing >>>>>>>> validation rules that require explicit casts to truncate values. >>>>>>>> >>>>>>>> That still leaves the use case that motivated this proposal, which >>>>>>>> is that floating point literals are parsed as decimals and fail simple >>>>>>>> insert statements. We already came up with two alternatives to fix that >>>>>>>> problem in the DSv2 sync and I think it is a better idea to go with >>>>>>>> one of >>>>>>>> those instead of “fixing” Spark in a way that will corrupt data or >>>>>>>> cause >>>>>>>> runtime failures. >>>>>>>> >>>>>>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I have heard about many complaints about the old table insertion >>>>>>>>> behavior. Blindly casting everything will leak the user mistake to a >>>>>>>>> late >>>>>>>>> stage of the data pipeline, and make it very hard to debug. When a >>>>>>>>> user >>>>>>>>> writes string values to an int column, it's probably a mistake and the >>>>>>>>> columns are misordered in the INSERT statement. We should fail the >>>>>>>>> query >>>>>>>>> earlier and ask users to fix the mistake. >>>>>>>>> >>>>>>>>> In the meanwhile, I agree that the new table insertion behavior we >>>>>>>>> introduced for Data Source V2 is too strict. It may fail valid queries >>>>>>>>> unexpectedly. >>>>>>>>> >>>>>>>>> In general, I support the direction of following the ANSI SQL >>>>>>>>> standard. But I'd like to do it with 2 steps: >>>>>>>>> 1. only add cast when the assignment rule is satisfied. This >>>>>>>>> should be the default behavior and we should provide a legacy config >>>>>>>>> to >>>>>>>>> restore to the old behavior. >>>>>>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK >>>>>>>>> Marco Gaido is working on it already. This will have a config as well >>>>>>>>> and >>>>>>>>> by default we still return null. >>>>>>>>> >>>>>>>>> After doing this, the default behavior will be slightly different >>>>>>>>> from the SQL standard (cast can return null), and users can turn on >>>>>>>>> the >>>>>>>>> ANSI mode to fully follow the SQL standard. This is much better than >>>>>>>>> before >>>>>>>>> and should prevent a lot of user mistakes. It's also a reasonable >>>>>>>>> choice to >>>>>>>>> me to not throw exceptions at runtime by default, as it's usually bad >>>>>>>>> for >>>>>>>>> long-running jobs. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Wenchen >>>>>>>>> >>>>>>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang < >>>>>>>>> gengliang.w...@databricks.com> wrote: >>>>>>>>> >>>>>>>>>> Hi everyone, >>>>>>>>>> >>>>>>>>>> I would like to discuss the table insertion behavior of Spark. In >>>>>>>>>> the current data source V2, only UpCast is allowed for table >>>>>>>>>> insertion. I >>>>>>>>>> think following ANSI SQL is a better idea. >>>>>>>>>> For more information, please read the Discuss: Follow ANSI SQL >>>>>>>>>> on table insertion >>>>>>>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing> >>>>>>>>>> Please let me know if you have any thoughts on this. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Gengliang >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> --- >>>>> Takeshi Yamamuro >>>>> >>>>