> 3) Ease the management overhead via upgrade story and layered configuration. Maybe even with unified failure management.
I'm not sure I get how this is supposed to work. Can you go a bit more in the details? On Thu, Dec 9, 2021 at 3:39 PM Timo Walther <twal...@apache.org> wrote: > Hi everyone, > > thanks for the healthy discussion. Let us find a compromise to make all > parties happy. > > > NULL in SQL essentially means "UNKNOWN" > > This argument is true, but as a user I would like to know the `Why?` a > value is UNKNOWN. I could imagine users have spent hours on finding the > root cause of a NULL value in their pipeline. This might not always be > reported in JIRA afterwards. > > > I'm ok to change the behavior, but just not now > > As Marios mentioned, we have introduced the flag for both new and legacy > behavior. I would suggest we wait a couple of releases before enabling > it. But not necessarily until Flink 2.0. > > I'm sure we will have more of these cases in the future as many built-in > functions have not been properly tested or verified against other > vendors. Nobody has reserved the time yet to look at the built-in > functions as a whole. We have very specialized ones whereas common > functions are missing. > > > a group of people managing some streaming platform > > Esp. streaming platforms should be able to have a way of setting a > config option globally for all existing pipelines. What we should fix in > the near future is that a flink-conf.yaml option can be propagated into > the code generation. We currently don't have a complete layered > configuration story that would allow such use case. > > Also the upgrade story (FLIP-190) will help us as functions are > versioned in the compiled plan. We can introduce a new CAST function > version. However, personally I would have liked to solve this before, as > CAST is a fundamental operation. > > How about the following steps? > > 1) Introduce CAST with option and TRY_CAST > 2) Add a warning to the documentation to let users prepare > 3) Ease the management overhead via upgrade story and layered > configuration. Maybe even with unified failure management. > 4) Change the default value of the option to new behavior > 5) Drop option in 2.0 > > Regards, > Timo > > > > On 01.12.21 17:13, Marios Trivyzas wrote: > > FYI: Opened https://github.com/apache/flink/pull/17985 which will > introduce > > the config option, > > so we can continue working on the CAST fixes and improvements. It will be > > very easy to flip > > the default behaviour (currently on the PR: legacy = ENABLED) when this > > discussion concludes, > > and update the documentation accordingly. > > > > On Mon, Nov 29, 2021 at 10:37 AM Marios Trivyzas <mat...@gmail.com> > wrote: > > > >> I definitely understand the argument to continue supporting the existing > >> (return null) as the default behaviour. > >> I'd like to point out though that if we decide this (option no.2) it's > >> kind of unnatural, to directly drop the flag in *Flink 2.0* > >> for example, and force the users at that point to use either *CAST > *(throw > >> errors) or *TRY_CAST* (define a default return value). > >> > >> The decision for the default value of this flag, is a commitment, > because > >> In my opinion, changing this default value in the future > >> to throw errors instead, is not an option, as this will definitely > confuse > >> the users, so the next step (in future versions) would be to > >> completely drop the flag and have the users choosing between *CAST* and > >> *TRY_CAST.* > >> > >> Therefore, and speaking from a developing cycle perspective, my personal > >> preference is to go with option no.1 which is in line > >> with the usual approach (at least to my experience :)) in the open > source > >> software. > >> > >> Best regards, > >> Marios > >> > >> > >> On Tue, Nov 23, 2021 at 12:59 PM Martijn Visser <mart...@ververica.com> > >> wrote: > >> > >>> Hi all, > >>> > >>> My conclusion at this point is that there is consensus that the default > >>> behaviour of CAST should raise errors when it fails and that it should > be > >>> configurable via a configuration flag. > >>> > >>> The open discussion is on when we want to fix the current (incorrect) > >>> behaviour: > >>> > >>> 1. Doing it in the next Flink release (1.15) by setting the > configuration > >>> flag to fail by default > >>> 2. Keep the current (incorrect) behaviour in Flink 1.15 by setting the > >>> configuration flag to the current situation and only changing this > >>> if/whenever a Flink 2.0 version is released. > >>> > >>> From my perspective, I can understand that going for option 2 is a > >>> preferred option for those that are running large Flink setups with a > >>> great > >>> number of users. I am wondering if those platforms also have the > ability > >>> to > >>> set default values and/or override user configuration. That could be a > way > >>> to solve this issue for these platform teams. > >>> > >>> I would prefer to go for option 1, because I think correct execution of > >>> CAST is important, especially for new Flink users. These new users > should > >>> have a smooth user experience and shouldn't need to change > configuration > >>> flags to get correct behaviour. I do expect that users who have used > Flink > >>> before are more familiar with checking release notes and interpreting > how > >>> this potentially affects them. That's why we have release notes. I also > >>> doubt that we will have a Flink 2.0 release any time soon, meaning > that we > >>> are only going to make the pain even bigger for more users if we change > >>> this incorrect behaviour at a later time. > >>> > >>> Best regards, > >>> > >>> Martijn > >>> > >>> On Tue, 23 Nov 2021 at 02:10, Kurt Young <ykt...@gmail.com> wrote: > >>> > >>>>> This is what I don't really understand here: how adding a > >>> configuration > >>>> option causes issues here? > >>>> This is why: for most Flink production use cases I see, it's not like > a > >>>> couple of people manage ~5 Flink > >>>> jobs, so they can easily track all the big changes in every minor > Flink > >>>> version. Typically use case are like > >>>> a group of people managing some streaming platform, which will provide > >>>> Flink as an execution engine > >>>> to their users. Alibaba has more than 40K online streaming SQL jobs, > and > >>>> ByteDance also has a similar > >>>> number. Most of the time, whether upgrading Flink version will be > >>>> controlled by the user of the platform, > >>>> not the platform provider. The platform will most likely provide > >>> multiple > >>>> Flink version support. > >>>> > >>>> Even if you can count on the platform provider to read all the release > >>>> notes carefully, their users won't. So > >>>> we are kind of throw the responsibility to all the platform provider, > >>> make > >>>> them to take care of the semantic > >>>> changes. They have to find some good way to control the impactions > when > >>>> their users upgrade Flink's version. > >>>> And if they don't find a good solution around this, and their users > >>>> encounter some online issues, they will be > >>>> blamed. And you can guess who they would blame. > >>>> > >>>> Flink is a very popular engine now, every decision we make will affect > >>> the > >>>> users a lot. If you want them to make > >>>> some changes, I would argue we should make them think it's worth it. > >>>> > >>>> Best, > >>>> Kurt > >>>> > >>>> > >>>> On Mon, Nov 22, 2021 at 11:29 PM Francesco Guardiani < > >>>> france...@ververica.com> wrote: > >>>> > >>>>>> NULL in SQL essentially means "UNKNOWN", it's not as scary as a > >>> null in > >>>>> java which will easily cause a NPE or some random behavior with a c++ > >>>>> function call. > >>>>> > >>>>> This is true from the user point of view, except our runtime doesn't > >>>> treat > >>>>> null as some value where you can safely execute operations and get > >>> "noop" > >>>>> results. In our runtime null is Java's null, hence causing issues and > >>>>> generating NPEs here and there when nulls are not expected. > >>>>> > >>>>>> It will really create a big mess after users upgrade their SQL jobs > >>>>> > >>>>> This is what I don't really understand here: how adding a > >>> configuration > >>>>> option causes issues here? We make it very clear in our release notes > >>>> that > >>>>> you need to switch that flag if you're relying on this behavior and > >>>> that's > >>>>> it: if you reprocess jobs every time you upgrade, you just flip the > >>>> switch > >>>>> before reprocessing and you won't have any issues. If you don't > >>> because > >>>> you > >>>>> use the hybrid source, either you upgrade your query or you flip the > >>> flag > >>>>> and in both cases this shouldn't generate any issue. > >>>>> Since it's a big change, I also expect to keep this flag for some > >>>> releases, > >>>>> at least up to Flink 2. > >>>>> > >>>>> On Sat, Nov 20, 2021 at 7:25 AM Kurt Young <ykt...@gmail.com> wrote: > >>>>> > >>>>>> Hi Francesco, > >>>>>> > >>>>>> Thanks for sharing your opinion about this and examples with other > >>>>>> programming > >>>>>> languages. I just want to mention, that NULL in SQL world is a bit > >>>>>> different with the > >>>>>> meaning in programming languages like java. > >>>>>> > >>>>>> NULL in SQL essentially means "UNKNOWN", it's not as scary as a > >>> null in > >>>>>> java which > >>>>>> will easily cause a NPE or some random behavior with a c++ function > >>>> call. > >>>>>> UNKNOWN > >>>>>> means it could be any value. In java, the condition "null == null" > >>>> always > >>>>>> return true. But > >>>>>> in SQL, it returns NULL, which means UNKNOWN. > >>>>>> > >>>>>> Another example, if you run following statements: > >>>>>> select 'true' where 3 in (1, 2, 3, null) // this will print true > >>>>>> select 'true' where 3 not in (1, 2, null) // this won't print > >>> anything > >>>>>> > >>>>>> In summary, SQL's NULL is a bit different from others, it has its > >>> own > >>>>>> meaning. So I won't > >>>>>> compare the behavior of returning NULL with programming languages > >>> and > >>>>> then > >>>>>> judge it > >>>>>> as bad behavior. And it's not a very big deal if we return NULL when > >>>>> trying > >>>>>> to cast "abc" > >>>>>> to an integer, which means we don't know the correct value. > >>>>>> > >>>>>> But still, I'm ok to change the behavior, but just not now. It will > >>>>> really > >>>>>> create a big mess after > >>>>>> users upgrade their SQL jobs. I'm either fine to do it in some > >>> really > >>>> big > >>>>>> version change like > >>>>>> Flink 2.0, or we can do it after we have some universal error > >>> records > >>>>>> handling mechanism, so > >>>>>> in that way, users could have a chance to handle such a situation. > >>>>>> > >>>>>> Best, > >>>>>> Kurt > >>>>>> > >>>>>> > >>>>>> On Fri, Nov 19, 2021 at 7:29 PM Francesco Guardiani < > >>>>>> france...@ververica.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> tl;dr: > >>>>>>> > >>>>>>> I think Timo pretty much said it all. As described in the issue, > >>> my > >>>>>>> proposal is: > >>>>>>> > >>>>>>> * Let's switch the default behavior of CAST to fail > >>>>>>> * Let's add TRY_CAST to have the old behavior > >>>>>>> * Let's add a rule (disabled by default) that wraps every CAST in > >>> a > >>>>> TRY, > >>>>>> in > >>>>>>> order to keep the old behavior. > >>>>>>> * Let's put a giant warning in the release notes explaining to > >>> enable > >>>>> the > >>>>>>> rule, in case you're depending on the old behavior > >>>>>>> > >>>>>>> This way, we break no SQL scripts, as you can apply this flag to > >>>> every > >>>>>>> previously running script. We can also think to another strategy, > >>>> more > >>>>>> than > >>>>>>> the planner rule, to keep the old behavior, always behind a flag > >>>>> disabled > >>>>>>> by default. > >>>>>>> > >>>>>>> Timing of this proposal is also crucial, since CAST is a basic > >>>>> primitive > >>>>>> of > >>>>>>> our language and, after we have the upgrade story in place, this > >>> is > >>>>> going > >>>>>>> to be a whole more harder to deal with. > >>>>>>> > >>>>>>> And I would say that in the next future, we should start thinking > >>> to > >>>>>>> support proper error handling strategies, that is: > >>>>>>> > >>>>>>> * How users are supposed to handle records that fails an > >>> expression > >>>>>>> computation, aggregation, etc? > >>>>>>> * Can we provide some default strategies, like log and discard, > >>> send > >>>>> to a > >>>>>>> dead letter queue? > >>>>>>> > >>>>>>> Now let me go a bit more deep in the reason we really need such > >>>> change: > >>>>>>> > >>>>>>> For me the issue is not really about being compliant with the SQL > >>>>>> standard > >>>>>>> or not, or the fact that other databases behaves differently from > >>> us, > >>>>> but > >>>>>>> the fact that the CAST function we have right now is effectively a > >>>>>> footgun > >>>>>>> <https://en.wiktionary.org/wiki/footgun> for our users. > >>>>>>> The concept of casting one value to another inherently involves > >>> some > >>>>>>> concept of failure, this is something as a programmer I expect, > >>>> exactly > >>>>>>> like when dividing a value by 0 or when sending a message to an > >>>>> external > >>>>>>> system. And this is why every programming language has some > >>> explicit > >>>>> way > >>>>>> to > >>>>>>> signal to you such failures exist and, some of them, even force > >>> you > >>>> to > >>>>>> deal > >>>>>>> with such failures, e.g. Java has the Exceptions and the try catch > >>>>> block, > >>>>>>> Rust has the ? operator, Golang returns you an error together with > >>>> the > >>>>>>> result. Not failing when a failure is inherently defined by the > >>>>> operation > >>>>>>> itself, or even not being explicit about the fact that such > >>> operation > >>>>> can > >>>>>>> fail, leads users to mistakenly think the operation they're doing > >>> is > >>>>>> always > >>>>>>> safe and cannot lead to failures. And this is IMHO really the > >>> problem > >>>>>> with > >>>>>>> the CAST primitive we have right now: it has a concept of failure > >>> but > >>>>> we > >>>>>>> shade it, and we're not even explicit about the fact that we're > >>>> shading > >>>>>> it > >>>>>>> [1], and we expect users to go in the documentation and read that > >>>> CAST > >>>>>> can > >>>>>>> return an eventual NULL and then deal with it. I even question why > >>>> for > >>>>>>> example, we return NULL more than a default sane value when an > >>>>> exception > >>>>>>> happens, e.g. it would be way better to return epoch 0 more than > >>> NULL > >>>>>> when > >>>>>>> failing a cast from STRING to TIMESTAMP. This is for example the > >>>>> approach > >>>>>>> taken by Golang: parsing string to timestamp function returns > >>> both a > >>>>> sane > >>>>>>> value and an error. > >>>>>>> > >>>>>>> And the cherry on top is that, for our users, the consequences of > >>> the > >>>>> bad > >>>>>>> usage of CAST are simply disastrous: best case, some operator > >>> fails > >>>>> with > >>>>>>> NPE, worst case you get bad results or even some data is lost > >>> down in > >>>>> the > >>>>>>> pipeline. We give no indication at all that the cast failed, and > >>> even > >>>>> if > >>>>>> we > >>>>>>> push a change to log "hey this cast failed on this record" it > >>> would > >>>>> still > >>>>>>> be extremely complicated to track down how badly a single cast > >>>> failure > >>>>>>> affected the results of a projection, a grouping, an aggregation, > >>>> etc. > >>>>>>> Hence my definition of our CAST function as a footgun. > >>>>>>> > >>>>>>> The bottom line for me is that our CAST primitive goes directly > >>>> against > >>>>>> the > >>>>>>> goal of Flink SQL to provide a simple to use API for developers > >>> and > >>>>>>> business people to develop computation pipelines, because it's not > >>>>>>> explicit, it silently fail with NULLs, and we require users to > >>> deal > >>>>> with > >>>>>>> it. > >>>>>>> > >>>>>>> The very same discussion applies with TO_TIMESTAMP, which among > >>> the > >>>>>> others > >>>>>>> might even be more crucial because we directly use it in our > >>>>>> documentation > >>>>>>> to tell users how to compute rowtime. > >>>>>>> > >>>>>>> FG > >>>>>>> > >>>>>>> [1] Note: here the naming is a fundamental part of the issue, the > >>>>>> function > >>>>>>> we have today is named CAST and not TRY_CAST or CAST_OR_NULL or > >>> any > >>>>> other > >>>>>>> name giving the indication that the operation might fail and > >>> provide > >>>> a > >>>>>>> result different from the cast result. > >>>>>>> > >>>>>>> > >>>>>>> On Fri, Nov 19, 2021 at 4:00 AM Kurt Young <ykt...@gmail.com> > >>> wrote: > >>>>>>> > >>>>>>>> Hi Timo, > >>>>>>>> > >>>>>>>> Regarding CAST, I think no one denies the standard behavior > >>> which > >>>>>> should > >>>>>>>> raise errors when > >>>>>>>> failed. The only question is how do we solve it, given lots of > >>>> users > >>>>>>>> already relying on current > >>>>>>>> more tolerant behavior. Some violation of standard but > >>> acceptable > >>>>>>> behavior > >>>>>>>> doesn't deserve > >>>>>>>> a breaking change in Flink minor version IMO, i'm more > >>> comfortable > >>>> to > >>>>>> fix > >>>>>>>> it in versions like > >>>>>>>> Flink 2.0. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Kurt > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, Nov 18, 2021 at 11:44 PM Timo Walther < > >>> twal...@apache.org> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi everyone, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> thanks for finally have this discussion on the mailing list. > >>> As > >>>>> both > >>>>>> a > >>>>>>>>> contributor and user, I have experienced a couple issues > >>> around > >>>>>>>>> nullability coming out of nowhere in a pipeline. This > >>> discussion > >>>>>> should > >>>>>>>>> not only cover CAST but failure handling in general. > >>>>>>>>> > >>>>>>>>> Let me summarize my opinion: > >>>>>>>>> > >>>>>>>>> 1) CAST vs. TRY_CAST > >>>>>>>>> > >>>>>>>>> CAST is a SQL standard core operation with well-defined > >>> semantics > >>>>>>> across > >>>>>>>>> all major SQL vendors. There should be no discussion whether > >>> it > >>>>>> returns > >>>>>>>>> NULL or an error. The semantics are already defined > >>> externally. I > >>>>>> don't > >>>>>>>>> agree with "Streaming computing is a resident program ... > >>> users > >>>> do > >>>>>> not > >>>>>>>>> want it to frequently fail", the same argument is also true > >>> for > >>>>>> nightly > >>>>>>>>> batch jobs. A batch job can also get stuck through a SQL > >>>> statement > >>>>>> that > >>>>>>>>> is not lenient enough defined by the user. > >>>>>>>>> > >>>>>>>>> An option that restores the old behavior and TRY_CAST for the > >>>>> future > >>>>>>>>> should solve this use case and make all parties happy. > >>>>>>>>> > >>>>>>>>> 2) TO_TIMESTAMP / TO_DATE > >>>>>>>>> > >>>>>>>>> We should distinguish between CASTING and CONVERSION / > >>> PARSING. > >>>> As > >>>>> a > >>>>>>>>> user, I would expect that parsing can fail and have to deal > >>> with > >>>>> this > >>>>>>>>> accordingly. Therefore, I'm fine with returning NULL in TO_ or > >>>>>> CONVERT_ > >>>>>>>>> functions. This is also consistent with other vendors. Take > >>> PARSE > >>>>> of > >>>>>>> SQL > >>>>>>>>> Server as an example [1]: "If a parameter with a null value is > >>>>> passed > >>>>>>> at > >>>>>>>>> run time, then a null is returned, to avoid canceling the > >>> whole > >>>>>>> batch.". > >>>>>>>>> Here we can be more flexible with the semantics because users > >>>> need > >>>>> to > >>>>>>>>> read the docs anyway. > >>>>>>>>> > >>>>>>>>> 3) Null at other locations > >>>>>>>>> > >>>>>>>>> In general, we should stick to our data type constraints. > >>>>> Everything > >>>>>>>>> else will mess up the architecture of functions/connectors and > >>>>> their > >>>>>>>>> return types. Take the rowtime (event-time timestamp) > >>> attribute > >>>> as > >>>>> an > >>>>>>>>> example: PRs like the one for FLINK-24885 are just the peak of > >>>> the > >>>>>>>>> iceberg. If we would allow rowtime columns to be NULL we would > >>>> need > >>>>>> to > >>>>>>>>> check all time-based operators and implement additional > >>> handling > >>>>>> logic > >>>>>>>>> for this. > >>>>>>>>> > >>>>>>>>> It would be better to define unified error-handling for > >>> operators > >>>>> and > >>>>>>>>> maybe drop rows if the per-element processing failed. We > >>> should > >>>>> have > >>>>>> a > >>>>>>>>> unified approach how to log/side output such records. > >>>>>>>>> > >>>>>>>>> Until this is in place, I would suggest we spend some time in > >>>> rules > >>>>>>> that > >>>>>>>>> can be enabled with an option for modifying the plan and wrap > >>>>>>> frequently > >>>>>>>>> failing expressions with a generic TRY() function. In this > >>> case, > >>>> we > >>>>>>>>> don't need to deal with NULL in all built-in functions, we can > >>>>> throw > >>>>>>>>> helpful errors during development, and can return NULL even > >>>> though > >>>>>> the > >>>>>>>>> return type is NOT NULL. It would also make the NULL returning > >>>>>> explicit > >>>>>>>>> in the plan. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Timo > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [1] > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://docs.microsoft.com/en-us/sql/t-sql/functions/parse-transact-sql?view=sql-server-ver15 > >>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-24885 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 18.11.21 11:34, Kurt Young wrote: > >>>>>>>>>> Sorry I forgot to add user ML. I also would like to gather > >>> some > >>>>>> users > >>>>>>>>>> feedback on this thing. > >>>>>>>>>> Since I didn't get any feedback on this topic before from > >>>> users. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Kurt > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Thu, Nov 18, 2021 at 6:33 PM Kurt Young < > >>> ykt...@gmail.com> > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> (added user ML to this thread) > >>>>>>>>>>> > >>>>>>>>>>> HI all, > >>>>>>>>>>> > >>>>>>>>>>> I would like to raise a different opinion about this > >>> change. I > >>>>>> agree > >>>>>>>>> with > >>>>>>>>>>> Ingo that > >>>>>>>>>>> we should not just break some existing behavior, and even > >>> if > >>>> we > >>>>>>>>> introduce > >>>>>>>>>>> an > >>>>>>>>>>> option to control the behavior, i would propose to set the > >>>>> default > >>>>>>>> value > >>>>>>>>>>> to current > >>>>>>>>>>> behavior. > >>>>>>>>>>> > >>>>>>>>>>> I want to mention one angle to assess whether we should > >>> change > >>>>> it > >>>>>> or > >>>>>>>>> not, > >>>>>>>>>>> which > >>>>>>>>>>> is "what could users benefit from the changes". To me, it > >>>> looks > >>>>>>> like: > >>>>>>>>>>> > >>>>>>>>>>> * new users: happy about the behavior > >>>>>>>>>>> * existing users: suffer from the change, it either cause > >>> them > >>>>> to > >>>>>>>> modify > >>>>>>>>>>> the SQL or > >>>>>>>>>>> got a call in late night reporting his online job got > >>> crashed > >>>>> and > >>>>>>>>> couldn't > >>>>>>>>>>> be able to > >>>>>>>>>>> restart. > >>>>>>>>>>> > >>>>>>>>>>> I would like to quote another breaking change we did when > >>> we > >>>>>> adjust > >>>>>>>> the > >>>>>>>>>>> time-related > >>>>>>>>>>> function in FLIP-162 [1]. In that case, both new users and > >>>>>> existing > >>>>>>>>> users > >>>>>>>>>>> are suffered > >>>>>>>>>>> from *incorrectly* implemented time function behavior, and > >>> we > >>>>> saw > >>>>>> a > >>>>>>>> lots > >>>>>>>>>>> of feedbacks and > >>>>>>>>>>> complains from various channels. After we fixed that, we > >>> never > >>>>> saw > >>>>>>>>> related > >>>>>>>>>>> problems again. > >>>>>>>>>>> > >>>>>>>>>>> Back to this topic, do we ever seen a user complain about > >>>>> current > >>>>>>> CAST > >>>>>>>>>>> behavior? Form my > >>>>>>>>>>> side, no. > >>>>>>>>>>> > >>>>>>>>>>> To summarize: > >>>>>>>>>>> > >>>>>>>>>>> +1 to introduce TRY_CAST to better prepare for the future. > >>>>>>>>>>> -1 to modify the default behavior. > >>>>>>>>>>> +0 to introduce a config option, but with the default > >>> value to > >>>>>>>> existing > >>>>>>>>>>> behavior. it's +0 because it > >>>>>>>>>>> seems not necessary if i'm -1 to change the default > >>> behavior > >>>> and > >>>>>>> also > >>>>>>>>>>> don't see an urgent to modify. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> [1] > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-162%3A+Consistent+Flink+SQL+time+function+behavior > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Kurt > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Nov 18, 2021 at 4:26 PM Ingo Bürk < > >>> i...@ververica.com > >>>>> > >>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> first of all, thanks for the summary of both sides, and > >>> for > >>>>>>> bringing > >>>>>>>> up > >>>>>>>>>>>> the > >>>>>>>>>>>> discussion on this. > >>>>>>>>>>>> I think it is obvious that this is not something we can > >>> just > >>>>>>> "break", > >>>>>>>>> so > >>>>>>>>>>>> the config option seems mandatory to me. > >>>>>>>>>>>> > >>>>>>>>>>>> Overall I agree with Martijn and Till that throwing > >>> errors is > >>>>> the > >>>>>>>> more > >>>>>>>>>>>> expected behavior. I mostly think this is valuable default > >>>>>> behavior > >>>>>>>>>>>> because > >>>>>>>>>>>> it allows developers to find mistakes early and diagnose > >>> them > >>>>>> much > >>>>>>>>> easier > >>>>>>>>>>>> compare to having to "work backwards" and figure out that > >>> it > >>>> is > >>>>>> the > >>>>>>>>> CAST > >>>>>>>>>>>> that failed. It also means that pipelines using TRY_CAST > >>> are > >>>>>>>>>>>> self-documenting because using that can signal "we might > >>>>> receive > >>>>>>>> broken > >>>>>>>>>>>> data here". > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Best > >>>>>>>>>>>> Ingo > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Nov 18, 2021 at 9:11 AM Till Rohrmann < > >>>>>>> trohrm...@apache.org> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi everyone, > >>>>>>>>>>>>> > >>>>>>>>>>>>> personally I would also prefer the system telling me that > >>>>>>> something > >>>>>>>> is > >>>>>>>>>>>>> wrong instead of silently ignoring records. If there is a > >>>>>> TRY_CAST > >>>>>>>>>>>> function > >>>>>>>>>>>>> that has the old behaviour, people can still get the old > >>>>>>> behaviour. > >>>>>>>>> For > >>>>>>>>>>>>> backwards compatibility reasons it is a good idea to > >>>>> introduce a > >>>>>>>>> switch > >>>>>>>>>>>> to > >>>>>>>>>>>>> get back the old behaviour. By default we could set it to > >>>> the > >>>>>> new > >>>>>>>>>>>>> behaviour, though. Of course, we should explicitly > >>> document > >>>>> this > >>>>>>> new > >>>>>>>>>>>>> behaviour so that people are aware of it before running > >>>> their > >>>>>> jobs > >>>>>>>> for > >>>>>>>>>>>> days > >>>>>>>>>>>>> and then encountering an invalid input. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>> Till > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Nov 18, 2021 at 9:02 AM Martijn Visser < > >>>>>>>> mart...@ververica.com > >>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Caizhi, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for bringing this up for discussion. I think the > >>>>>> important > >>>>>>>>>>>> part is > >>>>>>>>>>>>>> what do developers expect as the default behaviour of a > >>>> CAST > >>>>>>>> function > >>>>>>>>>>>>> when > >>>>>>>>>>>>>> casting fails. If I look at Postgres [1] or MSSQL [2], > >>> the > >>>>>>> default > >>>>>>>>>>>>>> behaviour of a CAST failing would be to return an error, > >>>>> which > >>>>>>>> would > >>>>>>>>>>>> be > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> new behaviour. Returning a value when a CAST fails can > >>> lead > >>>>> to > >>>>>>>> users > >>>>>>>>>>>> not > >>>>>>>>>>>>>> understanding immediately where that value comes from. > >>> So, > >>>> I > >>>>>>> would > >>>>>>>> be > >>>>>>>>>>>> in > >>>>>>>>>>>>>> favor of the new behaviour by default, but including a > >>>>>>>> configuration > >>>>>>>>>>>> flag > >>>>>>>>>>>>>> to maintain the old behaviour to avoid that you need to > >>>>> rewrite > >>>>>>> all > >>>>>>>>>>>> these > >>>>>>>>>>>>>> jobs. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best regards, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Martijn > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [1] > >>>>>> https://www.postgresql.org/docs/current/sql-createcast.html > >>>>>>>>>>>>>> [2] > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver15 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, 18 Nov 2021 at 03:17, Caizhi Weng < > >>>>>> tsreape...@gmail.com> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi devs! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We're discussing the behavior of casting functions > >>>>> (including > >>>>>>>> cast, > >>>>>>>>>>>>>>> to_timestamp, to_date, etc.) for invalid input in > >>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-24924. As > >>>> this > >>>>>>> topic > >>>>>>>> is > >>>>>>>>>>>>>>> crucial > >>>>>>>>>>>>>>> to compatibility and usability we'd like to continue > >>>>>> discussing > >>>>>>>> this > >>>>>>>>>>>>>>> publicly in the mailing list. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The main topic is to discuss that shall casting > >>> functions > >>>>>> return > >>>>>>>>>>>> null > >>>>>>>>>>>>>> (keep > >>>>>>>>>>>>>>> its current behavior) or throw exceptions (introduce a > >>> new > >>>>>>>>>>>> behavior). > >>>>>>>>>>>>> I'm > >>>>>>>>>>>>>>> trying to conclude the ideas on both sides. Correct me > >>> if > >>>> I > >>>>>> miss > >>>>>>>>>>>>>> something. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *From the devs who support throwing exceptions (new > >>>>>> behavior):* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The main concern is that if we silently return null > >>> then > >>>>>>>> unexpected > >>>>>>>>>>>>>> results > >>>>>>>>>>>>>>> or exceptions (mainly NullPointerException) may be > >>>> produced. > >>>>>>>>>>>> However, > >>>>>>>>>>>>> it > >>>>>>>>>>>>>>> will be hard for users to reason for this because > >>> there is > >>>>> no > >>>>>>>>>>>> detailed > >>>>>>>>>>>>>>> message. If we throw exceptions in the first place, > >>> then > >>>>> it's > >>>>>>> much > >>>>>>>>>>>>> easier > >>>>>>>>>>>>>>> to catch the exception with nice detailed messages > >>>>> explaining > >>>>>>> what > >>>>>>>>>>>> is > >>>>>>>>>>>>>> going > >>>>>>>>>>>>>>> wrong. Especially for this case of DATE/TIME/TIMESTAMP > >>>> it's > >>>>>> very > >>>>>>>>>>>>> helpful > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>>> have a detailed error and see where and why the parsing > >>>>> broke. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> For compatibility concerns, we can provide a TRY_CAST > >>>>> function > >>>>>>>>>>>> which is > >>>>>>>>>>>>>>> exactly the same as the current CAST function by > >>> returning > >>>>>> nulls > >>>>>>>> for > >>>>>>>>>>>>>>> invalid input. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *From the devs who support return null (current > >>>> behavior):* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The main concern is compatibility and usability. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On usability: The upstream system may occasionally > >>> produce > >>>>>>> invalid > >>>>>>>>>>>> data > >>>>>>>>>>>>>> and > >>>>>>>>>>>>>>> if we throw exceptions when seeing this it will fail > >>> the > >>>> job > >>>>>>> again > >>>>>>>>>>>> and > >>>>>>>>>>>>>>> again even after restart (because the invalid data is > >>>> always > >>>>>>>>>>>>>>> there). Streaming computing is a resident program and > >>>> users > >>>>> do > >>>>>>> not > >>>>>>>>>>>> want > >>>>>>>>>>>>>> it > >>>>>>>>>>>>>>> to frequently fail and cannot automatically recover. > >>> Most > >>>>>> users > >>>>>>>> are > >>>>>>>>>>>>>> willing > >>>>>>>>>>>>>>> to just skip that record and continue processing. > >>> Imagine > >>>> an > >>>>>>>> online > >>>>>>>>>>>> job > >>>>>>>>>>>>>>> running for a couple of weeks and suddenly fails due to > >>>> some > >>>>>>>>>>>> unexpected > >>>>>>>>>>>>>>> dirty data. What choices do users have to quickly > >>> resume > >>>> the > >>>>>>> job? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On compatibility: There are currently thousands of > >>> users > >>>> and > >>>>>>> tens > >>>>>>>> of > >>>>>>>>>>>>>>> thousands of jobs relying on the current behavior to > >>>> filter > >>>>>> out > >>>>>>>>>>>> invalid > >>>>>>>>>>>>>>> input. If we change the behavior it will be a disaster > >>> for > >>>>>> users > >>>>>>>> as > >>>>>>>>>>>>> they > >>>>>>>>>>>>>>> have to rewrite and check their SQL very carefully. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> What do you think? We're looking forward to your > >>> feedback. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >> > >> -- > >> Marios > >> > > > > > >