Re: [DISCUSS] FLIP-440: User-defined SQL operators / ProcessTableFunction (PTF)

Timo Walther Fri, 01 Nov 2024 02:26:38 -0700

Hi Xuyang,

thanks for the good questions.


1. What happens if the TTLs for these different StateHints are not the same?

The eval() fully determines available state and their TTL. Helpermethods such as onTimer() and finish() can references a subset ofdeclared state. It is not necessary that the helper methods declare allstate properties one more time. The name should be sufficient and weshould forbid setting additional properties.

2. I believe the named arguments introduced in FLIP-387[1] can also beapplied to this ProcessTableFunction, right?

Absolutely, the PTF actually needs named arguments. Esp for optionalfields such uid or on_time. For forward compatibility, I would evensuggest that PTFs only support named arguments. Not sure if we canenforce that.


3. Will we expose the original RowKind in the eval method's row input?

Yes, it's likely that only advanced users will take use of that. In thatcase users have to work with Row/RowData. It's likely that ratherbuild-in functions will make use of this. The default changelog mode forboth input and output is append.


4. Are we allowing users to define both styles simultaneously

Yes. Context is optional. And state access in helper methods(finish/onTimer) as well. This reduces the overhead in case a PTF runsin a container/other process.


I will update the FLIP to reflect these answers.

Thanks,
Timo



On 01.11.24 05:10, Xuyang wrote:

Hi, Timo.

Thank you for this great work! When I previously introduced the session window
TVF, I was contemplating

how to enable users to define a PTF in SQL. I'm glad to see this work being
discussed and that it has

improved the integration with the DataStream API.

After reading the entire flip, I have a few questions that I hope you can
address.

1. I noticed that in the example, the same field (e.g., CountState) can declare
a StateHint in the eval, onTimer,

and finish methods. What happens if the TTLs for these different StateHints are
not the same?

2. I believe the named arguments introduced in FLIP-387[1] can also be applied
to this ProcessTableFunction, right?

3. In our UDAFs, we expect users to provide accumulate and retract methods to
handle input data for +I/+U and -U/-D.

However, in the eval method of a ScalarFunction/UDTF, users do not have
visibility into the input's RowKind. In the new PTF,

will we expose the original RowKind in the eval method's row input, allowing
users to determine the row's RowKind themselves?

4. I noticed that in the examples, the eval method sometimes includes the
Context, @StateHint fields, and the input data (Row

input), while other times it only consists of the input data. Are we allowing
users to define both styles simultaneously？

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-387%3A+Support+named+parameters+for+functions+and+call+procedures

Best！
Xuyang

At 2024-10-31 21:57:37, "Timo Walther" <[email protected]> wrote:

Hi everyone,

thanks for all the feedback I received so far. I had very healthy
discussions with various people both online and offline at Current and
Flink Forward Berlin. The general user responses were also very
positive. The FLIP should be ready to start a VOTE thread.

This is the last call for feedback. I would start a VOTE tomorrow if
there are no objections. Happy to take further feedback during
implementation as well.

Thanks,
Timo

On 30.10.24 14:34, Timo Walther wrote:

Hi Jim,

3. Multiple output tables

  > Does the target_table need to be specified in the SELECT clause?

No. Similar to reading from a regular table. The filter column must not
be part of SELECT part.

  > It seems like the two target_table could have separate schemas defined.

That is true. The SELECT is responsible to transforms the columns into
the target table's schema. The output row of the PTF might be a union of
various columns in this case.

10. Support for State TTL

  > I'd be strongly in favor of doing any interface / base work we need in
  > the initial implementation so that state size can be managed.

I agree, State TTL is crucial. I updated the FLIP and added interfaces
to StateTypeStrategy and @StateHint.

Cheers,
Timo

On 23.10.24 17:59, Jim Hughes wrote:

Hi Timo,

Thank you for the answers.  I have a few clarifications inlined.

On Mon, Oct 14, 2024 at 8:07 AM Timo Walther <[email protected]> wrote:

3. Change of interfaces for multiple output tables
Currently, I think using a STATEMENT SET should be enough for side
output semantics. I have added an example in section 5.2.3.2 for that.
We are still free to add more methods to Context, let the function
implement additional interfaces or use more code generation together
with @ArgumentHints.


Does the target_table need to be specified in the SELECT clause?  Or
could
it read

EXECUTE STATEMENT SET BEGIN
     INSERT INTO main SELECT a, b FROM FunctionWithSideOutput(input =>
data,
uid = 'only_once') WHERE target_table = 'main';
     INSERT INTO side SELECT a, b FROM FunctionWithSideOutput(input =>
data,
uid = 'only_once') WHERE target_table = 'side';
END;

Separately, for clarity, it seems like the two target_table could have
separate schemas defined.

10. Support for State TTL
Supporting state TTL will be easy. We just need to add a parameter to
@StateHint and pass it through.


If PTFs can have state, I'd be strongly in favor of doing any interface /
base work we need in the initial implementation so that state size can be
managed.  If it is just sufficient to have hints in the interface,
awesome!

Cheers,

Jim

Re: [DISCUSS] FLIP-440: User-defined SQL operators / ProcessTableFunction (PTF)

Reply via email to