Hi Yisha, Thank you for the valuable feedback! I appreciate that you find this proposal beneficial. I'd like to answer your questions and explain the reason why we don't prefer to use SQL hints.
> 1. Hints may not only aim to inspire the planner. For example, we have dynamic options for tables and users can config parallelism for source/sink operators. State TTL is also kind of parameter which can be configured dynamically. The core reason for us to reject using hints to configure state TTL is not due to its dynamic configuration nature. In fact, configuring TTL through a compiled plan is also a form of dynamic adjustment. The key lies in the fact that state TTL affects the calculation results of data. From the SQL semantic perspective, hints cannot intervene in the calculation of data results. The adjustment of parallelism in your example is a good use case of hints. > 2. I agree that users need not to understand how SQL statements being translated to operators. And exactly for this reason, json plan is too complex for users to read. For example, if we enable 'table.optimizer.distinct-agg.split.enabled’ and 'table.optimizer.incremental-agg-enabled’, we get two Agg operators, I don’t think all users know which operators are related to their queries and set TTL correctly. Leaving aside the first semantic core point, the second reason is that hints are configured on SQL and actually work on operators. The scope of the hint is not clear enough. For example, in a job, the source needs to first do the de-duplication, then join with another source, and finally do the aggregation, and the user wants to set different state TTL at each stage, so which part of the SQL should the user write each hint in, and which part of the SQL should each hint act on? More importantly, some stateful operators are not reflected in SQL, such as ChangelogNormalize and SinkUpsertMaterializer, which are derived and added by the planner implicitly. You raise a good point that compiled plan JSON is less readable, but it's accurate enough. Back to the split distinct agg case, I think users who enable this configuration are fully aware of their data has a skewness and needs a two-layer group aggregate. Let's see what other people think. Best, Jane On Thu, Mar 23, 2023 at 6:00 PM Yisha Zhou <zhouyi...@bytedance.com.invalid> wrote: > Hi Jane, > Thanks for driving this, and your solution is absolutely creative. In my > company, there also exist some scenarios in which users want to config > state TTL at operator level, especially for window operators which regard > TTL as allow lateness. > > To support this scenarios, we implemented a query hint like below: > > SELECT /*+ STATE_TTL('1D') */ > id, > max(num) as num > FROM source > GROUP BY id > > For reasons to reject SQL Hints you mentioned in the FLIP, I have some > different opinions. > 1. Hints may not only aim to inspire the planner. For example, we have > dynamic options for tables and users can config parallelism for source/sink > operators. State TTL is also kind of parameter which can be configured > dynamically. > > 2. I agree that users need not to understand how SQL statements being > translated to operators. And exactly for this reason, json plan is too > complex for users to read. For example, if we enable > 'table.optimizer.distinct-agg.split.enabled’ and > 'table.optimizer.incremental-agg-enabled’, we get two Agg operators, I > don’t think all users know which operators are related to their queries and > set TTL correctly. > > Therefore, I personally prefer to support this by query hints and users > can config TTL for their ‘group by’ query instead of several operators. > > Best regards, > Yisha > > > 2023年3月21日 19:51,Jane Chan <qingyue....@gmail.com> 写道: > > > > Hi devs, > > > > I'd like to start a discussion on FLIP-292: Support configuring state TTL > > at operator level for Table API & SQL programs [1]. > > > > Currently, we only support job-level state TTL configuration via > > 'table.exec.state.ttl'. However, users may expect a fine-grained state > TTL > > control to optimize state usage. Hence we propose to > serialize/deserialize > > the state TTL as metadata of the operator's state to/from the compiled > JSON > > plan, to achieve the goal that specifying different state TTL when > > transforming the exec node to stateful operators. > > > > Look forward to your opinions! > > > > [1] > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240883951 > > > > Best Regards, > > Jane Chan > >