Re: [DISCUSS] Shall casting functions return null or throw exceptions for invalid input

Timo Walther Thu, 18 Nov 2021 07:44:20 -0800

Hi everyone,

thanks for finally have this discussion on the mailing list. As both acontributor and user, I have experienced a couple issues aroundnullability coming out of nowhere in a pipeline. This discussion shouldnot only cover CAST but failure handling in general.


Let me summarize my opinion:

1) CAST vs. TRY_CAST

CAST is a SQL standard core operation with well-defined semantics acrossall major SQL vendors. There should be no discussion whether it returnsNULL or an error. The semantics are already defined externally. I don'tagree with "Streaming computing is a resident program ... users do notwant it to frequently fail", the same argument is also true for nightlybatch jobs. A batch job can also get stuck through a SQL statement thatis not lenient enough defined by the user.

An option that restores the old behavior and TRY_CAST for the futureshould solve this use case and make all parties happy.


2) TO_TIMESTAMP / TO_DATE

We should distinguish between CASTING and CONVERSION / PARSING. As auser, I would expect that parsing can fail and have to deal with thisaccordingly. Therefore, I'm fine with returning NULL in TO_ or CONVERT_functions. This is also consistent with other vendors. Take PARSE of SQLServer as an example [1]: "If a parameter with a null value is passed atrun time, then a null is returned, to avoid canceling the whole batch.".Here we can be more flexible with the semantics because users need toread the docs anyway.


3) Null at other locations

In general, we should stick to our data type constraints. Everythingelse will mess up the architecture of functions/connectors and theirreturn types. Take the rowtime (event-time timestamp) attribute as anexample: PRs like the one for FLINK-24885 are just the peak of theiceberg. If we would allow rowtime columns to be NULL we would need tocheck all time-based operators and implement additional handling logicfor this.

It would be better to define unified error-handling for operators andmaybe drop rows if the per-element processing failed. We should have aunified approach how to log/side output such records.

Until this is in place, I would suggest we spend some time in rules thatcan be enabled with an option for modifying the plan and wrap frequentlyfailing expressions with a generic TRY() function. In this case, wedon't need to deal with NULL in all built-in functions, we can throwhelpful errors during development, and can return NULL even though thereturn type is NOT NULL. It would also make the NULL returning explicitin the plan.


Regards,
Timo

[1]https://docs.microsoft.com/en-us/sql/t-sql/functions/parse-transact-sql?view=sql-server-ver15

[2] https://issues.apache.org/jira/browse/FLINK-24885





On 18.11.21 11:34, Kurt Young wrote:

Sorry I forgot to add user ML. I also would like to gather some users
feedback on this thing.
Since I didn't get any feedback on this topic before from users.

Best,
Kurt


On Thu, Nov 18, 2021 at 6:33 PM Kurt Young <ykt...@gmail.com> wrote:

(added user ML to this thread)

HI all,

I would like to raise a different opinion about this change. I agree with
Ingo that
we should not just break some existing behavior, and even if we introduce
an
option to control the behavior, i would propose to set the default value
to current
behavior.

I want to mention one angle to assess whether we should change it or not,
which
is "what could users benefit from the changes". To me, it looks like:

* new users: happy about the behavior
* existing users: suffer from the change, it either cause them to modify
the SQL or
got a call in late night reporting his online job got crashed and couldn't
be able to
restart.

I would like to quote another breaking change we did when we adjust the
time-related
function in FLIP-162 [1]. In that case, both new users and existing users
are suffered
from *incorrectly* implemented time function behavior, and we saw a lots
of feedbacks and
complains from various channels. After we fixed that, we never saw related
problems again.

Back to this topic, do we ever seen a user complain about current CAST
behavior? Form my
side, no.

To summarize:

+1 to introduce TRY_CAST to better prepare for the future.
-1 to modify the default behavior.
+0 to introduce a config option, but with the default value to existing
behavior. it's +0 because it
seems not necessary if i'm -1 to change the default behavior and also
don't see an urgent to modify.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-162%3A+Consistent+Flink+SQL+time+function+behavior

Best,
Kurt


On Thu, Nov 18, 2021 at 4:26 PM Ingo Bürk <i...@ververica.com> wrote:

Hi,

first of all, thanks for the summary of both sides, and for bringing up
the
discussion on this.
I think it is obvious that this is not something we can just "break", so
the config option seems mandatory to me.

Overall I agree with Martijn and Till that throwing errors is the more
expected behavior. I mostly think this is valuable default behavior
because
it allows developers to find mistakes early and diagnose them much easier
compare to having to "work backwards" and figure out that it is the CAST
that failed. It also means that pipelines using TRY_CAST are
self-documenting because using that can signal "we might receive broken
data here".


Best
Ingo

On Thu, Nov 18, 2021 at 9:11 AM Till Rohrmann <trohrm...@apache.org>
wrote:

Hi everyone,

personally I would also prefer the system telling me that something is
wrong instead of silently ignoring records. If there is a TRY_CAST

function

that has the old behaviour, people can still get the old behaviour. For
backwards compatibility reasons it is a good idea to introduce a switch

to

get back the old behaviour. By default we could set it to the new
behaviour, though. Of course, we should explicitly document this new
behaviour so that people are aware of it before running their jobs for

days

and then encountering an invalid input.

Cheers,
Till

On Thu, Nov 18, 2021 at 9:02 AM Martijn Visser <mart...@ververica.com>
wrote:

Hi Caizhi,

Thanks for bringing this up for discussion. I think the important

part is

what do developers expect as the default behaviour of a CAST function

when

casting fails. If I look at Postgres [1] or MSSQL [2], the default
behaviour of a CAST failing would be to return an error, which would

be

the

new behaviour. Returning a value when a CAST fails can lead to users

not

understanding immediately where that value comes from. So, I would be

in

favor of the new behaviour by default, but including a configuration

flag

to maintain the old behaviour to avoid that you need to rewrite all

these

jobs.

Best regards,

Martijn

[1] https://www.postgresql.org/docs/current/sql-createcast.html
[2]

https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver15


On Thu, 18 Nov 2021 at 03:17, Caizhi Weng <tsreape...@gmail.com>

wrote:

Hi devs!

We're discussing the behavior of casting functions (including cast,
to_timestamp, to_date, etc.) for invalid input in
https://issues.apache.org/jira/browse/FLINK-24924. As this topic is
crucial
to compatibility and usability we'd like to continue discussing this
publicly in the mailing list.

The main topic is to discuss that shall casting functions return

null

(keep

its current behavior) or throw exceptions (introduce a new

behavior).

I'm

trying to conclude the ideas on both sides. Correct me if I miss

something.


*From the devs who support throwing exceptions (new behavior):*

The main concern is that if we silently return null then unexpected

results

or exceptions (mainly NullPointerException) may be produced.

However,

it

will be hard for users to reason for this because there is no

detailed

message. If we throw exceptions in the first place, then it's much

easier

to catch the exception with nice detailed messages explaining what

is

going

wrong. Especially for this case of DATE/TIME/TIMESTAMP it's very

helpful

to

have a detailed error and see where and why the parsing broke.

For compatibility concerns, we can provide a TRY_CAST function

which is

exactly the same as the current CAST function by returning nulls for
invalid input.

*From the devs who support return null (current behavior):*

The main concern is compatibility and usability.

On usability: The upstream system may occasionally produce invalid

data

and

if we throw exceptions when seeing this it will fail the job again

and

again even after restart (because the invalid data is always
there). Streaming computing is a resident program and users do not

want

it

to frequently fail and cannot automatically recover. Most users are

willing

to just skip that record and continue processing. Imagine an online

job

running for a couple of weeks and suddenly fails due to some

unexpected

dirty data. What choices do users have to quickly resume the job?

On compatibility: There are currently thousands of users and tens of
thousands of jobs relying on the current behavior to filter out

invalid

input. If we change the behavior it will be a disaster for users as

they

have to rewrite and check their SQL very carefully.


What do you think? We're looking forward to your feedback.

Re: [DISCUSS] Shall casting functions return null or throw exceptions for invalid input

Reply via email to