Hi,

first of all, thanks for the summary of both sides, and for bringing up the
discussion on this.
I think it is obvious that this is not something we can just "break", so
the config option seems mandatory to me.

Overall I agree with Martijn and Till that throwing errors is the more
expected behavior. I mostly think this is valuable default behavior because
it allows developers to find mistakes early and diagnose them much easier
compare to having to "work backwards" and figure out that it is the CAST
that failed. It also means that pipelines using TRY_CAST are
self-documenting because using that can signal "we might receive broken
data here".


Best
Ingo

On Thu, Nov 18, 2021 at 9:11 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Hi everyone,
>
> personally I would also prefer the system telling me that something is
> wrong instead of silently ignoring records. If there is a TRY_CAST function
> that has the old behaviour, people can still get the old behaviour. For
> backwards compatibility reasons it is a good idea to introduce a switch to
> get back the old behaviour. By default we could set it to the new
> behaviour, though. Of course, we should explicitly document this new
> behaviour so that people are aware of it before running their jobs for days
> and then encountering an invalid input.
>
> Cheers,
> Till
>
> On Thu, Nov 18, 2021 at 9:02 AM Martijn Visser <mart...@ververica.com>
> wrote:
>
> > Hi Caizhi,
> >
> > Thanks for bringing this up for discussion. I think the important part is
> > what do developers expect as the default behaviour of a CAST function
> when
> > casting fails. If I look at Postgres [1] or MSSQL [2], the default
> > behaviour of a CAST failing would be to return an error, which would be
> the
> > new behaviour. Returning a value when a CAST fails can lead to users not
> > understanding immediately where that value comes from. So, I would be in
> > favor of the new behaviour by default, but including a configuration flag
> > to maintain the old behaviour to avoid that you need to rewrite all these
> > jobs.
> >
> > Best regards,
> >
> > Martijn
> >
> > [1] https://www.postgresql.org/docs/current/sql-createcast.html
> > [2]
> >
> >
> https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver15
> >
> > On Thu, 18 Nov 2021 at 03:17, Caizhi Weng <tsreape...@gmail.com> wrote:
> >
> > > Hi devs!
> > >
> > > We're discussing the behavior of casting functions (including cast,
> > > to_timestamp, to_date, etc.) for invalid input in
> > > https://issues.apache.org/jira/browse/FLINK-24924. As this topic is
> > > crucial
> > > to compatibility and usability we'd like to continue discussing this
> > > publicly in the mailing list.
> > >
> > > The main topic is to discuss that shall casting functions return null
> > (keep
> > > its current behavior) or throw exceptions (introduce a new behavior).
> I'm
> > > trying to conclude the ideas on both sides. Correct me if I miss
> > something.
> > >
> > > *From the devs who support throwing exceptions (new behavior):*
> > >
> > > The main concern is that if we silently return null then unexpected
> > results
> > > or exceptions (mainly NullPointerException) may be produced. However,
> it
> > > will be hard for users to reason for this because there is no detailed
> > > message. If we throw exceptions in the first place, then it's much
> easier
> > > to catch the exception with nice detailed messages explaining what is
> > going
> > > wrong. Especially for this case of DATE/TIME/TIMESTAMP it's very
> helpful
> > to
> > > have a detailed error and see where and why the parsing broke.
> > >
> > > For compatibility concerns, we can provide a TRY_CAST function which is
> > > exactly the same as the current CAST function by returning nulls for
> > > invalid input.
> > >
> > > *From the devs who support return null (current behavior):*
> > >
> > > The main concern is compatibility and usability.
> > >
> > > On usability: The upstream system may occasionally produce invalid data
> > and
> > > if we throw exceptions when seeing this it will fail the job again and
> > > again even after restart (because the invalid data is always
> > > there). Streaming computing is a resident program and users do not want
> > it
> > > to frequently fail and cannot automatically recover. Most users are
> > willing
> > > to just skip that record and continue processing. Imagine an online job
> > > running for a couple of weeks and suddenly fails due to some unexpected
> > > dirty data. What choices do users have to quickly resume the job?
> > >
> > > On compatibility: There are currently thousands of users and tens of
> > > thousands of jobs relying on the current behavior to filter out invalid
> > > input. If we change the behavior it will be a disaster for users as
> they
> > > have to rewrite and check their SQL very carefully.
> > >
> > >
> > > What do you think? We're looking forward to your feedback.
> > >
> >
>

Reply via email to