Hi, all +1 for implementing this new store cast mode. >From a viewpoint of DBMS users, this cast is pretty common for INSERTs and I think this functionality could promote migrations from existing DBMSs to Spark.
The most important thing for DBMS users is that they could optionally choose this mode when inserting data. Therefore, I think it might be okay that the two modes (the current upcast mode and the proposed store cast mode) co-exist for INSERTs. (There is a room to discuss which mode is enabled by default though...) IMHO we'll provide three behaviours below for INSERTs; - upcast mode - ANSI store cast mode and runtime exceptions thrown for invalid values - ANSI store cast mode and null filled for invalid values On Sat, Jul 27, 2019 at 8:03 PM Gengliang Wang < gengliang.w...@databricks.com> wrote: > Hi Ryan, > > Thanks for the suggestions on the proposal and doc. > Currently, there is no data type validation in table insertion of V1. We > are on the same page that we should improve it. But using UpCast is from > one extreme to another. It is possible that many queries are broken after > upgrading to Spark 3.0. > The rules of UpCast are too strict. E.g. it doesn't allow assigning > Timestamp type to Date Type, as there will be "precision loss". To me, the > type coercion is reasonable and the "precision loss" is under expectation. > This is very common in other SQL engines. > As long as Spark is following the ANSI SQL store assignment rules, it is > users' responsibility to take good care of the type coercion in data > writing. I think it's the right decision. > > > But the new behavior is only applied in DataSourceV2, so it won’t affect > existing jobs until sources move to v2 and break other behavior anyway. > Eventually, most sources are supposed to be migrated to DataSourceV2 V2. I > think we can discuss and make a decision now. > > > Fixing the silent corruption by adding a runtime exception is not a good > option, either. > The new optional mode proposed in > https://issues.apache.org/jira/browse/SPARK-28512 is disabled by default. > This should be fine. > > > > On Sat, Jul 27, 2019 at 10:23 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> I don't agree with handling literal values specially. Although Postgres >> does it, I can't find anything about it in the SQL standard. And it >> introduces inconsistent behaviors which may be strange to users: >> * What about something like "INSERT INTO t SELECT float_col + 1.1"? >> * The same insert with a decimal column as input will fail even when a >> decimal literal would succeed >> * Similar insert queries with "literal" inputs can be constructed through >> layers of indirection via views, inline views, CTEs, unions, etc. Would >> those decimals be treated as columns and fail or would we attempt to make >> them succeed as well? Would users find this behavior surprising? >> >> Silently corrupt data is bad, but this is the decision we made at the >> beginning when design Spark behaviors. Whenever an error occurs, Spark >> attempts to return null instead of runtime exception. Recently we provide >> configs to make Spark fail at runtime for overflow, but that's another >> story. Silently corrupt data is bad, runtime exception is bad, and >> forbidding all the table insertions that may fail(even with very little >> possibility) is also bad. We have to make trade-offs. The trade-offs we >> made in this proposal are: >> * forbid table insertions that are very like to fail, at compile time. >> (things like writing string values to int column) >> * allow table insertions that are not that likely to fail. If the data is >> wrong, don't fail, insert null. >> * provide a config to fail the insertion at runtime if the data is wrong. >> >> > But the new behavior is only applied in DataSourceV2, so it won’t >> affect existing jobs until sources move to v2 and break other behavior >> anyway. >> When users write SQL queries, they don't care if a table is backed by >> Data Source V1 or V2. We should make sure the table insertion behavior is >> consistent and reasonable. Furthermore, users may even not care if the SQL >> queries are run in Spark or other RDBMS, it's better to follow SQL standard >> instead of introducing a Spark-specific behavior. >> >> We are not talking about a small use case like allowing writing decimal >> literal to float column, we are talking about a big goal to make Spark >> compliant to SQL standard, w.r.t. >> https://issues.apache.org/jira/browse/SPARK-26217 . This proposal is a >> sub-task of it, to make the table insertion behavior follow SQL standard. >> >> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> I don’t think this is a good idea. Following the ANSI standard is >>> usually fine, but here it would *silently corrupt data*. >>> >>> From your proposal doc, ANSI allows implicitly casting from long to int >>> (any numeric type to any other numeric type) and inserts NULL when a value >>> overflows. That would drop data values and is not safe. >>> >>> Fixing the silent corruption by adding a runtime exception is not a good >>> option, either. That puts off the problem until much of the job has >>> completed, instead of catching the error at analysis time. It is better to >>> catch this earlier during analysis than to run most of a job and then fail. >>> >>> In addition, part of the justification for using the ANSI standard is to >>> avoid breaking existing jobs. But the new behavior is only applied in >>> DataSourceV2, so it won’t affect existing jobs until sources move to v2 and >>> break other behavior anyway. >>> >>> I think that the correct solution is to go with the existing validation >>> rules that require explicit casts to truncate values. >>> >>> That still leaves the use case that motivated this proposal, which is >>> that floating point literals are parsed as decimals and fail simple insert >>> statements. We already came up with two alternatives to fix that problem in >>> the DSv2 sync and I think it is a better idea to go with one of those >>> instead of “fixing” Spark in a way that will corrupt data or cause runtime >>> failures. >>> >>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> I have heard about many complaints about the old table insertion >>>> behavior. Blindly casting everything will leak the user mistake to a late >>>> stage of the data pipeline, and make it very hard to debug. When a user >>>> writes string values to an int column, it's probably a mistake and the >>>> columns are misordered in the INSERT statement. We should fail the query >>>> earlier and ask users to fix the mistake. >>>> >>>> In the meanwhile, I agree that the new table insertion behavior we >>>> introduced for Data Source V2 is too strict. It may fail valid queries >>>> unexpectedly. >>>> >>>> In general, I support the direction of following the ANSI SQL standard. >>>> But I'd like to do it with 2 steps: >>>> 1. only add cast when the assignment rule is satisfied. This should be >>>> the default behavior and we should provide a legacy config to restore to >>>> the old behavior. >>>> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco >>>> Gaido is working on it already. This will have a config as well and by >>>> default we still return null. >>>> >>>> After doing this, the default behavior will be slightly different from >>>> the SQL standard (cast can return null), and users can turn on the ANSI >>>> mode to fully follow the SQL standard. This is much better than before and >>>> should prevent a lot of user mistakes. It's also a reasonable choice to me >>>> to not throw exceptions at runtime by default, as it's usually bad for >>>> long-running jobs. >>>> >>>> Thanks, >>>> Wenchen >>>> >>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang < >>>> gengliang.w...@databricks.com> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I would like to discuss the table insertion behavior of Spark. In the >>>>> current data source V2, only UpCast is allowed for table insertion. I >>>>> think >>>>> following ANSI SQL is a better idea. >>>>> For more information, please read the Discuss: Follow ANSI SQL on >>>>> table insertion >>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing> >>>>> Please let me know if you have any thoughts on this. >>>>> >>>>> Regards, >>>>> Gengliang >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> -- --- Takeshi Yamamuro