Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Bowen Li Wed, 11 Mar 2020 10:13:17 -0700

A quick summary that focus of the discussion now shifts to be whether
semantic params like kafka 'starting offset' should be table
hints/properties, and if so, in what form.


I strongly believe the action of setting offset should *not* be part of a
table, neither hints nor properties, for all the good reasons mentioned
above. It doesn't describe the physical table/storage. It's indeed a filter
describing what kind of data users want to query on top of all present,
physical data.

One way to think of that is 'offset' is a system column of Kafka, and Flink
queries can filter on this system column. E.g. Postgres has a column 'xmin'
to track commit timestamp automatically and users can filter on it [1].
Flink connectors should be able to define it's own system columns,
recognize them from filters pushed down by Flink SQL framework, and apply
accordingly.

[1]
https://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.5#Commit_timestamp_tracking



On Wed, Mar 11, 2020 at 6:20 AM Aljoscha Krettek <[email protected]>
wrote:

> Hi,
>
> I don't understand this discussion. Hints, as I understand them, should
> work like this:
>
> - hints are *optional* advice for the optimizer to try and help it to
> find a good execution strategy
> - hints should not change query semantics, i.e. they should not change
> connector properties executing a query with taking into account the
> hints *must* produce the same result as executing the query without
> taking into account the hints
>
>  From these simple requirements you can derive a solution that makes
> sense. I don't have a strong preference for the syntax but we should
> strive to be in line with prior work.
>
> Best,
> Aljoscha
>
> On 11.03.20 11:53, Danny Chan wrote:
> > Thanks Timo for summarize the 3 options ~
> >
> > I agree with Kurt that option2 is too complicated to use because:
> >
> > • As a Kafka topic consumer, the user must define both the virtual
> column for start offset and he must apply a special filter predicate after
> each query
> > • And for the internal implementation, the metadata column push down is
> another hard topic, each kind of message queue may have its offset
> attribute, we need to consider the expression type for different kind; the
> source also need to recognize the constant column as a config option(which
> is weird because usually what we pushed down is a table column)
> >
> > For option 1 and option3, I think there is no difference, option1 is
> also a hint syntax which is introduced in Sybase and referenced then
> deprecated by MS-SQL in 199X years because of the ambitiousness. Personally
> I prefer /*+ */ style table hint than WITH keyword for these reasons:
> >
> > • We do not break the standard SQL, the hints are nested in SQL comments
> > • We do not need to introduce additional WITH keyword which may appear
> in a query if we use that because a table can be referenced in all kinds of
> SQL contexts: INSERT/DELETE/FROM/JOIN …. That would make our sql query
> break too much of the SQL from standard
> > • We would have uniform syntax for hints as query hint, one syntax fits
> all and more easy to use
> >
> >
> > And here is the reason why we choose a uniform Oracle style query
> hint syntax which is addressed by Julian Hyde when we design the syntax
> from the Calcite community:
> >
> > I don’t much like the MSSQL-style syntax for table hints. It adds a new
> use of the WITH keyword that is unrelated to the use of WITH for
> common-table expressions.
> >
> > A historical note. Microsoft SQL Server inherited its hint syntax from
> Sybase a very long time ago. (See “Transact SQL Programming”[1], page 632,
> “Optimizer hints”. The book was written in 1999, and covers Microsoft SQL
> Server 6.5 / 7.0 and Sybase Adaptive Server 11.5, but the syntax very
> likely predates Sybase 4.3, from which Microsoft SQL Server was forked in
> 1993.)
> >
> > Microsoft later added the WITH keyword to make it less ambiguous, and
> has now deprecated the syntax that does not use WITH.
> >
> > They are forced to keep the syntax for backwards compatibility but that
> doesn’t mean that we should shoulder their burden.
> >
> > I think formatted comments are the right container for hints because it
> allows us to change the hint syntax without changing the SQL parser, and
> makes clear that we are at liberty to ignore hints entirely.
> >
> > Julian
> >
> > [1] https://www.amazon.com/s?k=9781565924017 <
> https://www.amazon.com/s?k=9781565924017>
> >
> > Best,
> > Danny Chan
> > 在 2020年3月11日 +0800 PM4:03，Timo Walther <[email protected]>，写道：
> >> Hi Danny,
> >>
> >> it is true that our DDL is not standard compliant by using the WITH
> >> clause. Nevertheless, we aim for not diverging too much and the LIKE
> >> clause is an example of that. It will solve things like overwriting
> >> WATERMARKs, add additional/modifying properties and inherit schema.
> >>
> >> Bowen is right that Flink's DDL is mixing 3 types definition together.
> >> We are not the first ones that try to solve this. There is also the SQL
> >> MED standard [1] that tried to tackle this problem. I think it was not
> >> considered when designing the current DDL.
> >>
> >> Currently, I see 3 options for handling Kafka offsets. I will give some
> >> examples and look forward to feedback here:
> >>
> >> *Option 1* Runtime and semantic parms as part of the query
> >>
> >> `SELECT * FROM MyTable('offset'=123)`
> >>
> >> Pros:
> >> - Easy to add
> >> - Parameters are part of the main query
> >> - No complicated hinting syntax
> >>
> >> Cons:
> >> - Not SQL compliant
> >>
> >> *Option 2* Use metadata in query
> >>
> >> `CREATE TABLE MyTable (id INT, offset AS SYSTEM_METADATA('offset'))`
> >>
> >> `SELECT * FROM MyTable WHERE offset > TIMESTAMP '2012-12-12 12:34:22'`
> >>
> >> Pros:
> >> - SQL compliant in the query
> >> - Access of metadata in the DDL which is required anyway
> >> - Regular pushdown rules apply
> >>
> >> Cons:
> >> - Users need to add an additional comlumn in the DDL
> >>
> >> *Option 3*: Use hints for properties
> >>
> >> `
> >> SELECT *
> >> FROM MyTable /*+ PROPERTIES('offset'=123) */
> >> `
> >>
> >> Pros:
> >> - Easy to add
> >>
> >> Cons:
> >> - Parameters are not part of the main query
> >> - Cryptic syntax for new users
> >> - Not standard compliant.
> >>
> >> If we go with this option, I would suggest to make it available in a
> >> separate map and don't mix it with statically defined properties. Such
> >> that the factory can decide which properties have the right to be
> >> overwritten by the hints:
> >> TableSourceFactory.Context.getQueryHints(): ReadableConfig
> >>
> >> Regards,
> >> Timo
> >>
> >> [1] https://en.wikipedia.org/wiki/SQL/MED
> >>
> >> Currently I see 3 options as a
> >>
> >>
> >> On 11.03.20 07:21, Danny Chan wrote:
> >>> Thanks Bowen ~
> >>>
> >>> I agree we should somehow categorize our connector parameters.
> >>>
> >>> For type1, I’m already preparing a solution like the Confluent schema
> registry + Avro schema inference thing, so this may not be a problem in the
> near future.
> >>>
> >>> For type3, I have some questions:
> >>>
> >>>> "SELECT * FROM mykafka WHERE offset > 12pm yesterday”
> >>>
> >>> Where does the offset column come from, a virtual column from the
> table schema, you said that
> >>>
> >>>> They change
> >>> almost every time a query starts and have nothing to do with metadata,
> thus
> >>> should not be part of table definition/DDL
> >>>
> >>> But why you can reference it in the query, I’m confused for that, can
> you elaborate a little ?
> >>>
> >>> Best,
> >>> Danny Chan
> >>> 在 2020年3月11日 +0800 PM12:52，Bowen Li <[email protected]>，写道：
> >>>> Thanks Danny for kicking off the effort
> >>>>
> >>>> The root cause of too much manual work is Flink DDL has mixed 3 types
> of
> >>>> params together and doesn't handle each of them very well. Below are
> how I
> >>>> categorize them and corresponding solutions in my mind:
> >>>>
> >>>> - type 1: Metadata of external data, like external endpoint/url,
> >>>> username/pwd, schemas, formats.
> >>>>
> >>>> Such metadata are mostly already accessible in external system as
> long as
> >>>> endpoints and credentials are provided. Flink can get it thru
> catalogs, but
> >>>> we haven't had many catalogs yet and thus Flink just hasn't been able
> to
> >>>> leverage that. So the solution should be building more catalogs. Such
> >>>> params should be part of a Flink table DDL/definition, and not
> overridable
> >>>> in any means.
> >>>>
> >>>>
> >>>> - type 2: Runtime params, like jdbc connector's fetch size,
> elasticsearch
> >>>> connector's bulk flush size.
> >>>>
> >>>> Such params don't affect query results, but affect how results are
> produced
> >>>> (eg. fast or slow, aka performance) - they are essentially execution
> and
> >>>> implementation details. They change often in exploration or
> development
> >>>> stages, but not quite frequently in well-defined long-running
> pipelines.
> >>>> They should always have default values and can be missing in query.
> They
> >>>> can be part of a table DDL/definition, but should also be replaceable
> in a
> >>>> query - *this is what table "hints" in FLIP-113 should cover*.
> >>>>
> >>>>
> >>>> - type 3: Semantic params, like kafka connector's start offset.
> >>>>
> >>>> Such params affect query results - the semantics. They'd better be as
> >>>> filter conditions in WHERE clause that can be pushed down. They change
> >>>> almost every time a query starts and have nothing to do with
> metadata, thus
> >>>> should not be part of table definition/DDL, nor be persisted in
> catalogs.
> >>>> If they will, users should create views to keep such params around
> (note
> >>>> this is different from variable substitution).
> >>>>
> >>>>
> >>>> Take Flink-Kafka as an example. Once we get these params right,
> here're the
> >>>> steps users need to do to develop and run a Flink job:
> >>>> - configure a Flink ConfluentSchemaRegistry with url, username, and
> password
> >>>> - run "SELECT * FROM mykafka WHERE offset > 12pm yesterday"
> (simplified
> >>>> timestamp) in SQL CLI, Flink automatically retrieves all metadata of
> >>>> schema, file format, etc and start the job
> >>>> - users want to make the job read Kafka topic faster, so it goes as
> "SELECT
> >>>> * FROM mykafka /* faster_read_key=value*/ WHERE offset > 12pm
> yesterday"
> >>>> - done and satisfied, users submit it to production
> >>>>
> >>>>
> >>>> Regarding "CREATE TABLE t LIKE with (k1=v1, k2=v2), I think it's a
> >>>> nice-to-have feature, but not a strategically critical, long-term
> solution,
> >>>> because
> >>>> 1) It may seem promising at the current stage to solve the
> >>>> too-much-manual-work problem, but that's only because Flink hasn't
> >>>> leveraged catalogs well and handled the 3 types of params above
> properly.
> >>>> Once we get the params types right, the LIKE syntax won't be that
> >>>> important, and will be just an easier way to create tables without
> retyping
> >>>> long fields like username and pwd.
> >>>> 2) Note that only some rare type of catalog can store k-v property
> pair, so
> >>>> table created this way often cannot be persisted. In the foreseeable
> >>>> future, such catalog will only be HiveCatalog, and not everyone has a
> Hive
> >>>> metastore. To be honest, without persistence, recreating tables every
> time
> >>>> this way is still a lot of keyboard typing.
> >>>>
> >>>> Cheers,
> >>>> Bowen
> >>>>
> >>>> On Tue, Mar 10, 2020 at 8:07 PM Kurt Young <[email protected]> wrote:
> >>>>
> >>>>> If a specific connector want to have such parameter and read if out
> of
> >>>>> configuration, then that's fine.
> >>>>> If we are talking about a configuration for all kinds of sources, I
> would
> >>>>> be super careful about that.
> >>>>> It's true it can solve maybe 80% cases, but it will also make the
> left 20%
> >>>>> feels weird.
> >>>>>
> >>>>> Best,
> >>>>> Kurt
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 11, 2020 at 11:00 AM Jark Wu <[email protected]> wrote:
> >>>>>
> >>>>>> Hi Kurt,
> >>>>>>
> >>>>>> #3 Regarding to global offset:
> >>>>>> I'm not saying to use the global configuration to override connector
> >>>>>> properties by the planner.
> >>>>>> But the connector should take this configuration and translate into
> their
> >>>>>> client API.
> >>>>>> AFAIK, almost all the message queues support eariliest and latest
> and a
> >>>>>> timestamp value as start point.
> >>>>>> So we can support 3 options for this configuration: "eariliest",
> "latest"
> >>>>>> and a timestamp string value.
> >>>>>> Of course, this can't solve 100% cases, but I guess can sovle 80%
> or 90%
> >>>>>> cases.
> >>>>>> And the remaining cases can be resolved by LIKE syntax which I
> guess is
> >>>>> not
> >>>>>> very common cases.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 11 Mar 2020 at 10:33, Kurt Young <[email protected]> wrote:
> >>>>>>
> >>>>>>> Good to have such lovely discussions. I also want to share some of
> my
> >>>>>>> opinions.
> >>>>>>>
> >>>>>>> #1 Regarding to error handling: I also think ignore invalid hints
> would
> >>>>>> be
> >>>>>>> dangerous, maybe
> >>>>>>> the simplest solution is just throw an exception.
> >>>>>>>
> >>>>>>> #2 Regarding to property replacement: I don't think we should
> >>>>> constraint
> >>>>>>> ourself to
> >>>>>>> the meaning of the word "hint", and forbidden it modifying any
> >>>>> properties
> >>>>>>> which can effect
> >>>>>>> query results. IMO `PROPERTIES` is one of the table hints, and a
> >>>>> powerful
> >>>>>>> one. It can
> >>>>>>> modify properties located in DDL's WITH block. But I also see the
> harm
> >>>>>> that
> >>>>>>> if we make it
> >>>>>>> too flexible like change the kafka topic name with a hint. Such use
> >>>>> case
> >>>>>> is
> >>>>>>> not common and
> >>>>>>> sounds very dangerous to me. I would propose we have a map of
> hintable
> >>>>>>> properties for each
> >>>>>>> connector, and should validate all passed in properties are
> actually
> >>>>>>> hintable. And combining with
> >>>>>>> #1 error handling, we can throw an exception once received invalid
> >>>>>>> property.
> >>>>>>>
> >>>>>>> #3 Regarding to global offset: I'm not sure it's feasible.
> Different
> >>>>>>> connectors will have totally
> >>>>>>> different properties to represent offset, some might be timestamps,
> >>>>> some
> >>>>>>> might be string literals
> >>>>>>> like "earliest", and others might be just integers.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Kurt
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 10, 2020 at 11:46 PM Jark Wu <[email protected]> wrote:
> >>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I want to jump in the discussion about the "dynamic start offset"
> >>>>>>> problem.
> >>>>>>>> First of all, I share the same concern with Timo and Fabian, that
> the
> >>>>>>>> "start offset" affects the query semantics, i.e. the query result.
> >>>>>>>> But "hints" is just used for optimization which should affect the
> >>>>>> result?
> >>>>>>>>
> >>>>>>>> I think the "dynamic start offset" is an very important usability
> >>>>>> problem
> >>>>>>>> which will be faced by many streaming platforms.
> >>>>>>>> I also agree "CREATE TEMPORARY TABLE Temp (LIKE t) WITH
> >>>>>>>> ('connector.startup-timestamp-millis' = '1578538374471')" is
> verbose,
> >>>>>>> what
> >>>>>>>> if we have 10 tables to join?
> >>>>>>>>
> >>>>>>>> However, what I want to propose (should be another thread) is a
> >>>>> global
> >>>>>>>> configuration to reset start offsets of all the source connectors
> >>>>>>>> in the query session, e.g. "table.sources.start-offset". This is
> >>>>>> possible
> >>>>>>>> now because `TableSourceFactory.Context` has `getConfiguration`
> >>>>>>>> method to get the session configuration, and use it to create an
> >>>>>> adapted
> >>>>>>>> TableSource.
> >>>>>>>> Then we can also expose to SQL CLI via SET command, e.g. `SET
> >>>>>>>> 'table.sources.start-offset'='earliest';`, which is pretty simple
> and
> >>>>>>>> straightforward.
> >>>>>>>>
> >>>>>>>> This is very similar to KSQL's `SET
> 'auto.offset.reset'='earliest'`
> >>>>>> which
> >>>>>>>> is very helpful IMO.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, 10 Mar 2020 at 22:29, Timo Walther <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Danny,
> >>>>>>>>>
> >>>>>>>>> compared to the hints, FLIP-110 is fully compliant to the SQL
> >>>>>> standard.
> >>>>>>>>>
> >>>>>>>>> I don't think that `CREATE TEMPORARY TABLE Temp (LIKE t) WITH
> >>>>> (k=v)`
> >>>>>> is
> >>>>>>>>> too verbose or awkward for the power of basically changing the
> >>>>> entire
> >>>>>>>>> connector. Usually, this statement would just precede the query
> in
> >>>>> a
> >>>>>>>>> multiline file. So it can be change "in-place" like the hints you
> >>>>>>>> proposed.
> >>>>>>>>>
> >>>>>>>>> Many companies have a well-defined set of tables that should be
> >>>>> used.
> >>>>>>> It
> >>>>>>>>> would be dangerous if users can change the path or topic in a
> hint.
> >>>>>> The
> >>>>>>>>> catalog/catalog manager should be the entity that controls which
> >>>>>> tables
> >>>>>>>>> exist and how they can be accessed.
> >>>>>>>>>
> >>>>>>>>>> what’s the problem there if we user the table hints to support
> >>>>>>> “start
> >>>>>>>>> offset”?
> >>>>>>>>>
> >>>>>>>>> IMHO it violates the meaning of a hint. According to the
> >>>>> dictionary,
> >>>>>> a
> >>>>>>>>> hint is "a statement that expresses indirectly what one prefers
> not
> >>>>>> to
> >>>>>>>>> say explicitly". But offsets are a property that are very
> explicit.
> >>>>>>>>>
> >>>>>>>>> If we go with the hint approach, it should be expressible in the
> >>>>>>>>> TableSourceFactory which properties are supported for hinting. Or
> >>>>> do
> >>>>>>> you
> >>>>>>>>> plan to offer those hints in a separate Map<String, String> that
> >>>>>> cannot
> >>>>>>>>> overwrite existing properties? I think this would be a different
> >>>>>>> story...
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Timo
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 10.03.20 10:34, Danny Chan wrote:
> >>>>>>>>>> Thanks Timo ~
> >>>>>>>>>>
> >>>>>>>>>> Personally I would say that offset > 0 and start offset = 10
> does
> >>>>>> not
> >>>>>>>>> have the same semantic, so from the SQL aspect, we can not
> >>>>> implement
> >>>>>> a
> >>>>>>>>> “starting offset” hint for query with such a syntax.
> >>>>>>>>>>
> >>>>>>>>>> And the CREATE TABLE LIKE syntax is a DDL which is just verbose
> >>>>> for
> >>>>>>>>> defining such dynamic parameters even if it could do that, shall
> we
> >>>>>>> force
> >>>>>>>>> users to define a temporal table for each query with dynamic
> >>>>> params,
> >>>>>> I
> >>>>>>>>> would say it’s an awkward solution.
> >>>>>>>>>>
> >>>>>>>>>> "Hints should give "hints" but not affect the actual produced
> >>>>>>> result.”
> >>>>>>>>> You mentioned that multiple times and could we give a reason,
> >>>>> what’s
> >>>>>>> the
> >>>>>>>>> problem there if we user the table hints to support “start
> offset”
> >>>>> ?
> >>>>>>> From
> >>>>>>>>> my side I saw some benefits for that:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> • It’s very convent to set up these parameters, the syntax is
> >>>>> very
> >>>>>>> much
> >>>>>>>>> like the DDL definition
> >>>>>>>>>> • It’s scope is very clear, right on the table it attathed
> >>>>>>>>>> • It does not affect the table schema, which means in order to
> >>>>>>> specify
> >>>>>>>>> the offset, there is no need to define an offset column which is
> >>>>>> weird
> >>>>>>>>> actually, offset should never be a column, it’s more like a
> >>>>> metadata
> >>>>>>> or a
> >>>>>>>>> start option.
> >>>>>>>>>>
> >>>>>>>>>> So in total, FLIP-110 uses the offset more like a Hive partition
> >>>>>>> prune,
> >>>>>>>>> we can do that if we have an offset column, but most of the case
> we
> >>>>>> do
> >>>>>>>> not
> >>>>>>>>> define that, so there is actually no conflict or overlap.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Danny Chan
> >>>>>>>>>> 在 2020年3月10日 +0800 PM4:28，Timo Walther <[email protected]>，写道：
> >>>>>>>>>>> Hi Danny,
> >>>>>>>>>>>
> >>>>>>>>>>> shouldn't FLIP-110[1] solve most of the problems we have around
> >>>>>>>> defining
> >>>>>>>>>>> table properties more dynamically without manual schema work?
> >>>>> Also
> >>>>>>>>>>> offset definition is easier with such a syntax. They must not
> be
> >>>>>>>> defined
> >>>>>>>>>>> in catalog but could be temporary tables that extend from the
> >>>>>>> original
> >>>>>>>>>>> table.
> >>>>>>>>>>>
> >>>>>>>>>>> In general, we should aim to keep the syntax concise and don't
> >>>>>>> provide
> >>>>>>>>>>> too many ways of doing the same thing. Hints should give
> "hints"
> >>>>>> but
> >>>>>>>> not
> >>>>>>>>>>> affect the actual produced result.
> >>>>>>>>>>>
> >>>>>>>>>>> Some connector properties might also change the plan or schema
> >>>>> in
> >>>>>>> the
> >>>>>>>>>>> future. E.g. they might also define whether a table source
> >>>>>> supports
> >>>>>>>>>>> certain push-downs (e.g. predicate push-down).
> >>>>>>>>>>>
> >>>>>>>>>>> Dawid is currently working a draft that might makes it possible
> >>>>> to
> >>>>>>>>>>> expose a Kafka offset via the schema such that `SELECT * FROM
> >>>>>> Topic
> >>>>>>>>>>> WHERE offset > 10` would become possible and could be pushed
> >>>>> down.
> >>>>>>> But
> >>>>>>>>>>> this is of course, not planned initially.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Timo
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> [1]
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-110%3A+Support+LIKE+clause+in+CREATE+TABLE
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 10.03.20 08:34, Danny Chan wrote:
> >>>>>>>>>>>> Thanks Wenlong ~
> >>>>>>>>>>>>
> >>>>>>>>>>>> For PROPERTIES Hint Error handling
> >>>>>>>>>>>>
> >>>>>>>>>>>> Actually we have no way to figure out whether a error prone
> >>>>> hint
> >>>>>>> is a
> >>>>>>>>> PROPERTIES hint, for example, if use writes a hint like
> >>>>> ‘PROPERTIAS’,
> >>>>>>> we
> >>>>>>>> do
> >>>>>>>>> not know if this hint is a PROPERTIES hint, what we know is that
> >>>>> the
> >>>>>>> hint
> >>>>>>>>> name was not registered in our Flink.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If the user writes the hint name correctly (i.e. PROPERTIES),
> >>>>> we
> >>>>>>> did
> >>>>>>>>> can enforce the validation of the hint options though the
> pluggable
> >>>>>>>>> HintOptionChecker.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For PROPERTIES Hint Option Format
> >>>>>>>>>>>>
> >>>>>>>>>>>> For a key value style hint option, the key can be either a
> >>>>> simple
> >>>>>>>>> identifier or a string literal, which means that it’s compatible
> >>>>> with
> >>>>>>> our
> >>>>>>>>> DDL syntax. We support simple identifier because many other hints
> >>>>> do
> >>>>>>> not
> >>>>>>>>> have the component complex keys like the table properties, and we
> >>>>>> want
> >>>>>>> to
> >>>>>>>>> unify the parse block.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Danny Chan
> >>>>>>>>>>>> 在 2020年3月10日 +0800 PM3:19，wenlong.lwl <
> [email protected]
> >>>>>>>> ，写道：
> >>>>>>>>>>>>> Hi Danny, thanks for the proposal. +1 for adding table hints,
> >>>>> it
> >>>>>>> is
> >>>>>>>>> really
> >>>>>>>>>>>>> a necessary feature for flink sql to integrate with a
> catalog.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For error handling, I think it would be more natural to throw
> >>>>> an
> >>>>>>>>>>>>> exception when error table hint provided, because the
> >>>>> properties
> >>>>>>> in
> >>>>>>>>> hint
> >>>>>>>>>>>>> will be merged and used to find the table factory which would
> >>>>>>> cause
> >>>>>>>> an
> >>>>>>>>>>>>> exception when error properties provided, right? On the other
> >>>>>>> hand,
> >>>>>>>>> unlike
> >>>>>>>>>>>>> other hints which just affect the way to execute the query,
> >>>>> the
> >>>>>>>>> property
> >>>>>>>>>>>>> table hint actually affects the result of the query, we
> should
> >>>>>>> never
> >>>>>>>>> ignore
> >>>>>>>>>>>>> the given property hints.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For the format of property hints, currently, in sql client,
> we
> >>>>>>>> accept
> >>>>>>>>>>>>> properties in format of string only in DDL:
> >>>>>>>> 'connector.type'='kafka',
> >>>>>>>>> I
> >>>>>>>>>>>>> think the format of properties in hint should be the same as
> >>>>> the
> >>>>>>>>> format we
> >>>>>>>>>>>>> defined in ddl. What do you think?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Bests,
> >>>>>>>>>>>>> Wenlong Lyu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, 10 Mar 2020 at 14:22, Danny Chan <
> >>>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> To Weike: About the Error Handing
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To be consistent with other SQL vendors, the default is to
> >>>>> log
> >>>>>>>>> warnings
> >>>>>>>>>>>>>> and if there is any error (invalid hint name or options),
> the
> >>>>>>> hint
> >>>>>>>>> is just
> >>>>>>>>>>>>>> ignored. I have already addressed in the wiki.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To Timo: About the PROPERTIES Table Hint
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> • The properties hints is also optional, user can pass in an
> >>>>>>> option
> >>>>>>>>> to
> >>>>>>>>>>>>>> override the table properties but this does not mean it is
> >>>>>>>> required.
> >>>>>>>>>>>>>> • They should not include semantics: does the properties
> >>>>> belong
> >>>>>>> to
> >>>>>>>>>>>>>> semantic ? I don't think so, the plan does not change right
> ?
> >>>>>> The
> >>>>>>>>> result
> >>>>>>>>>>>>>> set may be affected, but there are already some hints do so,
> >>>>>> for
> >>>>>>>>> example,
> >>>>>>>>>>>>>> MS-SQL MAXRECURSION and SNAPSHOT hint [1]
> >>>>>>>>>>>>>> • `SELECT * FROM t(k=v, k=v)`: this grammar breaks the SQL
> >>>>>>> standard
> >>>>>>>>>>>>>> compared to the hints way(which is included in comments)
> >>>>>>>>>>>>>> • I actually didn't found any vendors to support such
> >>>>> grammar,
> >>>>>>> and
> >>>>>>>>> there
> >>>>>>>>>>>>>> is no way to override table level properties dynamically.
> For
> >>>>>>>> normal
> >>>>>>>>> RDBMS,
> >>>>>>>>>>>>>> I think there are no requests for such dynamic parameters
> >>>>>> because
> >>>>>>>>> all the
> >>>>>>>>>>>>>> table have the same storage and computation and they are
> >>>>> almost
> >>>>>>> all
> >>>>>>>>> batch
> >>>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>> • While Flink as a computation engine has many connectors,
> >>>>>>>>> especially for
> >>>>>>>>>>>>>> some message queue like Kafka, we would have a start_offset
> >>>>>> which
> >>>>>>>> is
> >>>>>>>>>>>>>> different each time we start the query, such parameters can
> >>>>> not
> >>>>>>> be
> >>>>>>>>>>>>>> persisted to catalog, because it’s not static, this is
> >>>>> actually
> >>>>>>> the
> >>>>>>>>>>>>>> background we propose the table hints to indicate such
> >>>>>> properties
> >>>>>>>>>>>>>> dynamically.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To Jark and Jinsong: I have removed the query hints part and
> >>>>>>> change
> >>>>>>>>> the
> >>>>>>>>>>>>>> title.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Danny Chan
> >>>>>>>>>>>>>> 在 2020年3月9日 +0800 PM5:46，Timo Walther <[email protected]
> >>>>>> ，写道：
> >>>>>>>>>>>>>>> Hi Danny,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks for the proposal. I agree with Jark and Jingsong.
> >>>>>> Planner
> >>>>>>>>> hints
> >>>>>>>>>>>>>>> and table hints are orthogonal topics that should be
> >>>>> discussed
> >>>>>>>>>>>>>> separately.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I share Jingsong's opinion that we should not use planner
> >>>>>> hints
> >>>>>>>> for
> >>>>>>>>>>>>>>> passing connector properties. Planner hints should be
> >>>>> optional
> >>>>>>> at
> >>>>>>>>> any
> >>>>>>>>>>>>>>> time. They should not include semantics but only affect
> >>>>>>> execution
> >>>>>>>>> time.
> >>>>>>>>>>>>>>> Connector properties are an important part of the query
> >>>>>> itself.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Have you thought about options such as `SELECT * FROM
> t(k=v,
> >>>>>>>> k=v)`?
> >>>>>>>>> How
> >>>>>>>>>>>>>>> are other vendors deal with this problem?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 09.03.20 10:37, Jingsong Li wrote:
> >>>>>>>>>>>>>>>> Hi Danny, +1 for table hints, thanks for driving.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I took a look to FLIP, most of content are talking about
> >>>>>> query
> >>>>>>>>> hints.
> >>>>>>>>>>>>>> It is
> >>>>>>>>>>>>>>>> hard to discussion and voting. So +1 to split it as Jark
> >>>>>> said.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another thing is configuration that suitable to config
> with
> >>>>>>> table
> >>>>>>>>>>>>>> hints:
> >>>>>>>>>>>>>>>> "connector.path" and "connector.topic", Are they really
> >>>>>>> suitable
> >>>>>>>>> for
> >>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>> hints? Looks weird to me. Because I think these properties
> >>>>>> are
> >>>>>>>> the
> >>>>>>>>>>>>>> core of
> >>>>>>>>>>>>>>>> table.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Mon, Mar 9, 2020 at 5:30 PM Jark Wu <[email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks Danny for starting the discussion.
> >>>>>>>>>>>>>>>>> +1 for this feature.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> If we just focus on the table hints not the query hints
> in
> >>>>>>> this
> >>>>>>>>>>>>>> release,
> >>>>>>>>>>>>>>>>> could you split the FLIP into two FLIPs?
> >>>>>>>>>>>>>>>>> Because it's hard to vote on partial part of a FLIP. You
> >>>>> can
> >>>>>>>> keep
> >>>>>>>>>>>>>> the table
> >>>>>>>>>>>>>>>>> hints proposal in FLIP-113 and move query hints into
> >>>>> another
> >>>>>>>> FLIP.
> >>>>>>>>>>>>>>>>> So that we can focuse on the table hints in the FLIP.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, 9 Mar 2020 at 17:14, DONG, Weike <
> >>>>>>>> [email protected]
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Danny,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is a nice feature, +1.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> One thing I am interested in but not mentioned in the
> >>>>>>> proposal
> >>>>>>>> is
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> error
> >>>>>>>>>>>>>>>>>> handling, as it is quite common for users to write
> >>>>>>>> inappropriate
> >>>>>>>>>>>>>> hints in
> >>>>>>>>>>>>>>>>>> SQL code, if illegal or "bad" hints are given, would the
> >>>>>>> system
> >>>>>>>>>>>>>> simply
> >>>>>>>>>>>>>>>>>> ignore them or throw exceptions?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks : )
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Weike
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, Mar 9, 2020 at 5:02 PM Danny Chan <
> >>>>>>>> [email protected]>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Note:
> >>>>>>>>>>>>>>>>>>> we only plan to support table hints in Flink release
> >>>>> 1.11,
> >>>>>>> so
> >>>>>>>>>>>>>> please
> >>>>>>>>>>>>>>>>>> focus
> >>>>>>>>>>>>>>>>>>> mainly on the table hints part and just ignore the
> >>>>> planner
> >>>>>>>>>>>>>> hints, sorry
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> that mistake ~
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>> Danny Chan
> >>>>>>>>>>>>>>>>>>> 在 2020年3月9日 +0800 PM4:36，Danny Chan <
> >>>>> [email protected]
> >>>>>>>>> ，写道：
> >>>>>>>>>>>>>>>>>>>> Hi, fellows ~
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I would like to propose the supports for SQL hints for
> >>>>>> our
> >>>>>>>>>>>>>> Flink SQL.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> We would support hints syntax as following:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> select /*+ NO_HASH_JOIN, RESOURCE(mem='128mb',
> >>>>>>>>>>>>>> parallelism='24') */
> >>>>>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>>>> emp /*+ INDEX(idx1, idx2) */
> >>>>>>>>>>>>>>>>>>>> join
> >>>>>>>>>>>>>>>>>>>> dept /*+ PROPERTIES(k1='v1', k2='v2') */
> >>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>> emp.deptno = dept.deptno
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Basically we would support both query hints(after the
> >>>>>>> SELECT
> >>>>>>>>>>>>>> keyword)
> >>>>>>>>>>>>>>>>>>> and table hints(after the referenced table name), for
> >>>>>> 1.11,
> >>>>>>> we
> >>>>>>>>>>>>>> plan to
> >>>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>> support table hints with a hint probably named
> >>>>> PROPERTIES:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> table_name /*+ PROPERTIES(k1='v1', k2='v2') *+/
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I am looking forward to your comments.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> You can access the FLIP here:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+SQL+and+Planner+Hints
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>> Danny Chan
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Reply via email to