Re: [External] [DISCUSS] FLIP-292: Support configuring state TTL at operator level for Table API & SQL programs

Jane Chan Thu, 23 Mar 2023 05:11:29 -0700

Hi Yisha,

Thank you for the valuable feedback! I appreciate that you find this
proposal beneficial. I'd like to answer your questions and explain the
reason why we don't prefer to use SQL hints.

> 1. Hints may not only aim to inspire the planner. For example, we have
dynamic options for tables and users can config parallelism for source/sink
operators. State TTL is also kind of parameter which can be configured
dynamically.

The core reason for us to reject using hints to configure state TTL is not
due to its dynamic configuration nature. In fact, configuring TTL through a
compiled plan is also a form of dynamic adjustment. The key lies in the
fact that state TTL affects the calculation results of data. From the SQL
semantic perspective, hints cannot intervene in the calculation of data
results. The adjustment of parallelism in your example is a good use case
of hints.

> 2. I agree that users need not to understand how SQL statements being
translated to operators. And exactly for this reason, json plan is too
complex for users to read. For example, if we enable
'table.optimizer.distinct-agg.split.enabled’ and
'table.optimizer.incremental-agg-enabled’, we get two Agg operators, I
don’t think all users know which operators are related to their queries and
set TTL correctly.

Leaving aside the first semantic core point, the second reason is that
hints are configured on SQL and actually work on operators. The scope of
the hint is not clear enough. For example, in a job, the source needs to
first do the de-duplication, then join with another source, and finally do
the aggregation, and the user wants to set different state TTL at each
stage, so which part of the SQL should the user write each hint in, and
which part of the SQL should each hint act on?  More importantly, some
stateful operators are not reflected in SQL, such as ChangelogNormalize and
SinkUpsertMaterializer, which are derived and added by the planner
implicitly.
You raise a good point that compiled plan JSON is less readable, but it's
accurate enough. Back to the split distinct agg case, I think users who
enable this configuration are fully aware of their data has a skewness and
needs a two-layer group aggregate.

Let's see what other people think.

Best,
Jane

On Thu, Mar 23, 2023 at 6:00 PM Yisha Zhou <zhouyi...@bytedance.com.invalid>
wrote:

> Hi Jane,
> Thanks for driving this, and your solution is absolutely creative. In my
> company, there also exist some scenarios in which users want to config
> state TTL at operator level, especially for window operators which regard
> TTL as allow lateness.
>
> To support this scenarios, we implemented a query hint like below:
>
> SELECT /*+ STATE_TTL('1D') */
>     id,
>     max(num) as num
> FROM source
> GROUP BY id
>
> For reasons to reject SQL Hints you mentioned in the FLIP, I have some
> different opinions.
> 1. Hints may not only aim to inspire the planner. For example, we have
> dynamic options for tables and users can config parallelism for source/sink
> operators. State TTL is also kind of parameter which can be configured
> dynamically.
>
> 2. I agree that users need not to understand how SQL statements being
> translated to operators. And exactly for this reason, json plan is too
> complex for users to read. For example, if we enable
> 'table.optimizer.distinct-agg.split.enabled’ and
> 'table.optimizer.incremental-agg-enabled’, we get two Agg operators, I
> don’t think all users know which operators are related to their queries and
> set TTL correctly.
>
> Therefore, I personally prefer to support this by query hints and users
> can config TTL for their ‘group by’ query instead of several operators.
>
> Best regards,
> Yisha
>
> > 2023年3月21日 19:51，Jane Chan <qingyue....@gmail.com> 写道：
> >
> > Hi devs,
> >
> > I'd like to start a discussion on FLIP-292: Support configuring state TTL
> > at operator level for Table API & SQL programs [1].
> >
> > Currently, we only support job-level state TTL configuration via
> > 'table.exec.state.ttl'. However, users may expect a fine-grained state
> TTL
> > control to optimize state usage. Hence we propose to
> serialize/deserialize
> > the state TTL as metadata of the operator's state to/from the compiled
> JSON
> > plan, to achieve the goal that specifying different state TTL when
> > transforming the exec node to stateful operators.
> >
> > Look forward to your opinions!
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240883951
> >
> > Best Regards,
> > Jane Chan
>
>

Re: [External] [DISCUSS] FLIP-292: Support configuring state TTL at operator level for Table API & SQL programs

Reply via email to