Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Benchao Li Wed, 11 Mar 2020 19:23:14 -0700

Hi all,

Thanks Danny for bring up this great discussion, generally hints is a great
feature for SQL.
And the discussions are very insightful, I'd like to share some ideas here.


About error handling,
+1 to throw exception by default. IMHO, hints are parts of the query, which
should be validated before
execution. If we just log errors, maybe it's hard for users to debug their
job when the behavior of the
job differs from what they expects. An early validation and error throwing
can enforce users to make sure
their query is valid for both SQL and hints.

About whether hints can affect results,
Before the disscution, I thought that hints maybe useful to do many things
we cannot do with SQL standard,
e.g.
- Fast Emit for WindowAggregate
- State retention time for general group by and join
- allow lateness for WindowAggregate
- keyby before temporal join
- mini batch config
- local-global aggregate
Most of these things are configured globally, which is very inconvenient to
use in complex scenario.
Some of these can affect results, and IMHO it's ok to do so, because
Steaming SQL is a relative new
concept which differs in many aspects from Batch SQL.

About whether allow to add connector properties in hints,
IMHO, it's good to keep the grammar concise. If we can do that already with
connector properties, and
LIKE feature recently proposed, maybe we don't need to add another way to
do that. A clean and concise
grammar can benefit users a lot especially for users like inf team in each
compony. They maybe face
a lot internal users, and the variety of ways to do the same thing can
bring much burden to them.

About global `set xxx=xxx`,
it's a good idea to have this ability. IMHO, it's useful for some global
job configurations,
e.g.
- global parallelism
- checkpoint config
- auto watermark interval
- time zone
- serializers
However, I don't think it's a good idea to have some connector configs like
'auto.offset.reset'.
KSQL does that because it only deals with Kafka, and 'auto.offset.reset' is
indeed a global config
for that scenario. Flink is a general streaming (and batch) engine, which
IMHO we cannot assume all
connectors have some common configs like that. It's good to keep connectors
config in each table source
definition, even we need to set it multiple times.

Bowen Li <[email protected]> 于2020年3月12日周四 上午1:12写道：

> A quick summary that focus of the discussion now shifts to be whether
> semantic params like kafka 'starting offset' should be table
> hints/properties, and if so, in what form.
>
> I strongly believe the action of setting offset should *not* be part of a
> table, neither hints nor properties, for all the good reasons mentioned
> above. It doesn't describe the physical table/storage. It's indeed a filter
> describing what kind of data users want to query on top of all present,
> physical data.
>
> One way to think of that is 'offset' is a system column of Kafka, and Flink
> queries can filter on this system column. E.g. Postgres has a column 'xmin'
> to track commit timestamp automatically and users can filter on it [1].
> Flink connectors should be able to define it's own system columns,
> recognize them from filters pushed down by Flink SQL framework, and apply
> accordingly.
>
> [1]
>
> https://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.5#Commit_timestamp_tracking
>
>
>
> On Wed, Mar 11, 2020 at 6:20 AM Aljoscha Krettek <[email protected]>
> wrote:
>
> > Hi,
> >
> > I don't understand this discussion. Hints, as I understand them, should
> > work like this:
> >
> > - hints are *optional* advice for the optimizer to try and help it to
> > find a good execution strategy
> > - hints should not change query semantics, i.e. they should not change
> > connector properties executing a query with taking into account the
> > hints *must* produce the same result as executing the query without
> > taking into account the hints
> >
> >  From these simple requirements you can derive a solution that makes
> > sense. I don't have a strong preference for the syntax but we should
> > strive to be in line with prior work.
> >
> > Best,
> > Aljoscha
> >
> > On 11.03.20 11:53, Danny Chan wrote:
> > > Thanks Timo for summarize the 3 options ~
> > >
> > > I agree with Kurt that option2 is too complicated to use because:
> > >
> > > • As a Kafka topic consumer, the user must define both the virtual
> > column for start offset and he must apply a special filter predicate
> after
> > each query
> > > • And for the internal implementation, the metadata column push down is
> > another hard topic, each kind of message queue may have its offset
> > attribute, we need to consider the expression type for different kind;
> the
> > source also need to recognize the constant column as a config
> option(which
> > is weird because usually what we pushed down is a table column)
> > >
> > > For option 1 and option3, I think there is no difference, option1 is
> > also a hint syntax which is introduced in Sybase and referenced then
> > deprecated by MS-SQL in 199X years because of the ambitiousness.
> Personally
> > I prefer /*+ */ style table hint than WITH keyword for these reasons:
> > >
> > > • We do not break the standard SQL, the hints are nested in SQL
> comments
> > > • We do not need to introduce additional WITH keyword which may appear
> > in a query if we use that because a table can be referenced in all kinds
> of
> > SQL contexts: INSERT/DELETE/FROM/JOIN …. That would make our sql query
> > break too much of the SQL from standard
> > > • We would have uniform syntax for hints as query hint, one syntax fits
> > all and more easy to use
> > >
> > >
> > > And here is the reason why we choose a uniform Oracle style query
> > hint syntax which is addressed by Julian Hyde when we design the syntax
> > from the Calcite community:
> > >
> > > I don’t much like the MSSQL-style syntax for table hints. It adds a new
> > use of the WITH keyword that is unrelated to the use of WITH for
> > common-table expressions.
> > >
> > > A historical note. Microsoft SQL Server inherited its hint syntax from
> > Sybase a very long time ago. (See “Transact SQL Programming”[1], page
> 632,
> > “Optimizer hints”. The book was written in 1999, and covers Microsoft SQL
> > Server 6.5 / 7.0 and Sybase Adaptive Server 11.5, but the syntax very
> > likely predates Sybase 4.3, from which Microsoft SQL Server was forked in
> > 1993.)
> > >
> > > Microsoft later added the WITH keyword to make it less ambiguous, and
> > has now deprecated the syntax that does not use WITH.
> > >
> > > They are forced to keep the syntax for backwards compatibility but that
> > doesn’t mean that we should shoulder their burden.
> > >
> > > I think formatted comments are the right container for hints because it
> > allows us to change the hint syntax without changing the SQL parser, and
> > makes clear that we are at liberty to ignore hints entirely.
> > >
> > > Julian
> > >
> > > [1] https://www.amazon.com/s?k=9781565924017 <
> > https://www.amazon.com/s?k=9781565924017>
> > >
> > > Best,
> > > Danny Chan
> > > 在 2020年3月11日 +0800 PM4:03，Timo Walther <[email protected]>，写道：
> > >> Hi Danny,
> > >>
> > >> it is true that our DDL is not standard compliant by using the WITH
> > >> clause. Nevertheless, we aim for not diverging too much and the LIKE
> > >> clause is an example of that. It will solve things like overwriting
> > >> WATERMARKs, add additional/modifying properties and inherit schema.
> > >>
> > >> Bowen is right that Flink's DDL is mixing 3 types definition together.
> > >> We are not the first ones that try to solve this. There is also the
> SQL
> > >> MED standard [1] that tried to tackle this problem. I think it was not
> > >> considered when designing the current DDL.
> > >>
> > >> Currently, I see 3 options for handling Kafka offsets. I will give
> some
> > >> examples and look forward to feedback here:
> > >>
> > >> *Option 1* Runtime and semantic parms as part of the query
> > >>
> > >> `SELECT * FROM MyTable('offset'=123)`
> > >>
> > >> Pros:
> > >> - Easy to add
> > >> - Parameters are part of the main query
> > >> - No complicated hinting syntax
> > >>
> > >> Cons:
> > >> - Not SQL compliant
> > >>
> > >> *Option 2* Use metadata in query
> > >>
> > >> `CREATE TABLE MyTable (id INT, offset AS SYSTEM_METADATA('offset'))`
> > >>
> > >> `SELECT * FROM MyTable WHERE offset > TIMESTAMP '2012-12-12 12:34:22'`
> > >>
> > >> Pros:
> > >> - SQL compliant in the query
> > >> - Access of metadata in the DDL which is required anyway
> > >> - Regular pushdown rules apply
> > >>
> > >> Cons:
> > >> - Users need to add an additional comlumn in the DDL
> > >>
> > >> *Option 3*: Use hints for properties
> > >>
> > >> `
> > >> SELECT *
> > >> FROM MyTable /*+ PROPERTIES('offset'=123) */
> > >> `
> > >>
> > >> Pros:
> > >> - Easy to add
> > >>
> > >> Cons:
> > >> - Parameters are not part of the main query
> > >> - Cryptic syntax for new users
> > >> - Not standard compliant.
> > >>
> > >> If we go with this option, I would suggest to make it available in a
> > >> separate map and don't mix it with statically defined properties. Such
> > >> that the factory can decide which properties have the right to be
> > >> overwritten by the hints:
> > >> TableSourceFactory.Context.getQueryHints(): ReadableConfig
> > >>
> > >> Regards,
> > >> Timo
> > >>
> > >> [1] https://en.wikipedia.org/wiki/SQL/MED
> > >>
> > >> Currently I see 3 options as a
> > >>
> > >>
> > >> On 11.03.20 07:21, Danny Chan wrote:
> > >>> Thanks Bowen ~
> > >>>
> > >>> I agree we should somehow categorize our connector parameters.
> > >>>
> > >>> For type1, I’m already preparing a solution like the Confluent schema
> > registry + Avro schema inference thing, so this may not be a problem in
> the
> > near future.
> > >>>
> > >>> For type3, I have some questions:
> > >>>
> > >>>> "SELECT * FROM mykafka WHERE offset > 12pm yesterday”
> > >>>
> > >>> Where does the offset column come from, a virtual column from the
> > table schema, you said that
> > >>>
> > >>>> They change
> > >>> almost every time a query starts and have nothing to do with
> metadata,
> > thus
> > >>> should not be part of table definition/DDL
> > >>>
> > >>> But why you can reference it in the query, I’m confused for that, can
> > you elaborate a little ?
> > >>>
> > >>> Best,
> > >>> Danny Chan
> > >>> 在 2020年3月11日 +0800 PM12:52，Bowen Li <[email protected]>，写道：
> > >>>> Thanks Danny for kicking off the effort
> > >>>>
> > >>>> The root cause of too much manual work is Flink DDL has mixed 3
> types
> > of
> > >>>> params together and doesn't handle each of them very well. Below are
> > how I
> > >>>> categorize them and corresponding solutions in my mind:
> > >>>>
> > >>>> - type 1: Metadata of external data, like external endpoint/url,
> > >>>> username/pwd, schemas, formats.
> > >>>>
> > >>>> Such metadata are mostly already accessible in external system as
> > long as
> > >>>> endpoints and credentials are provided. Flink can get it thru
> > catalogs, but
> > >>>> we haven't had many catalogs yet and thus Flink just hasn't been
> able
> > to
> > >>>> leverage that. So the solution should be building more catalogs.
> Such
> > >>>> params should be part of a Flink table DDL/definition, and not
> > overridable
> > >>>> in any means.
> > >>>>
> > >>>>
> > >>>> - type 2: Runtime params, like jdbc connector's fetch size,
> > elasticsearch
> > >>>> connector's bulk flush size.
> > >>>>
> > >>>> Such params don't affect query results, but affect how results are
> > produced
> > >>>> (eg. fast or slow, aka performance) - they are essentially execution
> > and
> > >>>> implementation details. They change often in exploration or
> > development
> > >>>> stages, but not quite frequently in well-defined long-running
> > pipelines.
> > >>>> They should always have default values and can be missing in query.
> > They
> > >>>> can be part of a table DDL/definition, but should also be
> replaceable
> > in a
> > >>>> query - *this is what table "hints" in FLIP-113 should cover*.
> > >>>>
> > >>>>
> > >>>> - type 3: Semantic params, like kafka connector's start offset.
> > >>>>
> > >>>> Such params affect query results - the semantics. They'd better be
> as
> > >>>> filter conditions in WHERE clause that can be pushed down. They
> change
> > >>>> almost every time a query starts and have nothing to do with
> > metadata, thus
> > >>>> should not be part of table definition/DDL, nor be persisted in
> > catalogs.
> > >>>> If they will, users should create views to keep such params around
> > (note
> > >>>> this is different from variable substitution).
> > >>>>
> > >>>>
> > >>>> Take Flink-Kafka as an example. Once we get these params right,
> > here're the
> > >>>> steps users need to do to develop and run a Flink job:
> > >>>> - configure a Flink ConfluentSchemaRegistry with url, username, and
> > password
> > >>>> - run "SELECT * FROM mykafka WHERE offset > 12pm yesterday"
> > (simplified
> > >>>> timestamp) in SQL CLI, Flink automatically retrieves all metadata of
> > >>>> schema, file format, etc and start the job
> > >>>> - users want to make the job read Kafka topic faster, so it goes as
> > "SELECT
> > >>>> * FROM mykafka /* faster_read_key=value*/ WHERE offset > 12pm
> > yesterday"
> > >>>> - done and satisfied, users submit it to production
> > >>>>
> > >>>>
> > >>>> Regarding "CREATE TABLE t LIKE with (k1=v1, k2=v2), I think it's a
> > >>>> nice-to-have feature, but not a strategically critical, long-term
> > solution,
> > >>>> because
> > >>>> 1) It may seem promising at the current stage to solve the
> > >>>> too-much-manual-work problem, but that's only because Flink hasn't
> > >>>> leveraged catalogs well and handled the 3 types of params above
> > properly.
> > >>>> Once we get the params types right, the LIKE syntax won't be that
> > >>>> important, and will be just an easier way to create tables without
> > retyping
> > >>>> long fields like username and pwd.
> > >>>> 2) Note that only some rare type of catalog can store k-v property
> > pair, so
> > >>>> table created this way often cannot be persisted. In the foreseeable
> > >>>> future, such catalog will only be HiveCatalog, and not everyone has
> a
> > Hive
> > >>>> metastore. To be honest, without persistence, recreating tables
> every
> > time
> > >>>> this way is still a lot of keyboard typing.
> > >>>>
> > >>>> Cheers,
> > >>>> Bowen
> > >>>>
> > >>>> On Tue, Mar 10, 2020 at 8:07 PM Kurt Young <[email protected]>
> wrote:
> > >>>>
> > >>>>> If a specific connector want to have such parameter and read if out
> > of
> > >>>>> configuration, then that's fine.
> > >>>>> If we are talking about a configuration for all kinds of sources, I
> > would
> > >>>>> be super careful about that.
> > >>>>> It's true it can solve maybe 80% cases, but it will also make the
> > left 20%
> > >>>>> feels weird.
> > >>>>>
> > >>>>> Best,
> > >>>>> Kurt
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Mar 11, 2020 at 11:00 AM Jark Wu <[email protected]> wrote:
> > >>>>>
> > >>>>>> Hi Kurt,
> > >>>>>>
> > >>>>>> #3 Regarding to global offset:
> > >>>>>> I'm not saying to use the global configuration to override
> connector
> > >>>>>> properties by the planner.
> > >>>>>> But the connector should take this configuration and translate
> into
> > their
> > >>>>>> client API.
> > >>>>>> AFAIK, almost all the message queues support eariliest and latest
> > and a
> > >>>>>> timestamp value as start point.
> > >>>>>> So we can support 3 options for this configuration: "eariliest",
> > "latest"
> > >>>>>> and a timestamp string value.
> > >>>>>> Of course, this can't solve 100% cases, but I guess can sovle 80%
> > or 90%
> > >>>>>> cases.
> > >>>>>> And the remaining cases can be resolved by LIKE syntax which I
> > guess is
> > >>>>> not
> > >>>>>> very common cases.
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Jark
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, 11 Mar 2020 at 10:33, Kurt Young <[email protected]>
> wrote:
> > >>>>>>
> > >>>>>>> Good to have such lovely discussions. I also want to share some
> of
> > my
> > >>>>>>> opinions.
> > >>>>>>>
> > >>>>>>> #1 Regarding to error handling: I also think ignore invalid hints
> > would
> > >>>>>> be
> > >>>>>>> dangerous, maybe
> > >>>>>>> the simplest solution is just throw an exception.
> > >>>>>>>
> > >>>>>>> #2 Regarding to property replacement: I don't think we should
> > >>>>> constraint
> > >>>>>>> ourself to
> > >>>>>>> the meaning of the word "hint", and forbidden it modifying any
> > >>>>> properties
> > >>>>>>> which can effect
> > >>>>>>> query results. IMO `PROPERTIES` is one of the table hints, and a
> > >>>>> powerful
> > >>>>>>> one. It can
> > >>>>>>> modify properties located in DDL's WITH block. But I also see the
> > harm
> > >>>>>> that
> > >>>>>>> if we make it
> > >>>>>>> too flexible like change the kafka topic name with a hint. Such
> use
> > >>>>> case
> > >>>>>> is
> > >>>>>>> not common and
> > >>>>>>> sounds very dangerous to me. I would propose we have a map of
> > hintable
> > >>>>>>> properties for each
> > >>>>>>> connector, and should validate all passed in properties are
> > actually
> > >>>>>>> hintable. And combining with
> > >>>>>>> #1 error handling, we can throw an exception once received
> invalid
> > >>>>>>> property.
> > >>>>>>>
> > >>>>>>> #3 Regarding to global offset: I'm not sure it's feasible.
> > Different
> > >>>>>>> connectors will have totally
> > >>>>>>> different properties to represent offset, some might be
> timestamps,
> > >>>>> some
> > >>>>>>> might be string literals
> > >>>>>>> like "earliest", and others might be just integers.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Kurt
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Mar 10, 2020 at 11:46 PM Jark Wu <[email protected]>
> wrote:
> > >>>>>>>
> > >>>>>>>> Hi everyone,
> > >>>>>>>>
> > >>>>>>>> I want to jump in the discussion about the "dynamic start
> offset"
> > >>>>>>> problem.
> > >>>>>>>> First of all, I share the same concern with Timo and Fabian,
> that
> > the
> > >>>>>>>> "start offset" affects the query semantics, i.e. the query
> result.
> > >>>>>>>> But "hints" is just used for optimization which should affect
> the
> > >>>>>> result?
> > >>>>>>>>
> > >>>>>>>> I think the "dynamic start offset" is an very important
> usability
> > >>>>>> problem
> > >>>>>>>> which will be faced by many streaming platforms.
> > >>>>>>>> I also agree "CREATE TEMPORARY TABLE Temp (LIKE t) WITH
> > >>>>>>>> ('connector.startup-timestamp-millis' = '1578538374471')" is
> > verbose,
> > >>>>>>> what
> > >>>>>>>> if we have 10 tables to join?
> > >>>>>>>>
> > >>>>>>>> However, what I want to propose (should be another thread) is a
> > >>>>> global
> > >>>>>>>> configuration to reset start offsets of all the source
> connectors
> > >>>>>>>> in the query session, e.g. "table.sources.start-offset". This is
> > >>>>>> possible
> > >>>>>>>> now because `TableSourceFactory.Context` has `getConfiguration`
> > >>>>>>>> method to get the session configuration, and use it to create an
> > >>>>>> adapted
> > >>>>>>>> TableSource.
> > >>>>>>>> Then we can also expose to SQL CLI via SET command, e.g. `SET
> > >>>>>>>> 'table.sources.start-offset'='earliest';`, which is pretty
> simple
> > and
> > >>>>>>>> straightforward.
> > >>>>>>>>
> > >>>>>>>> This is very similar to KSQL's `SET
> > 'auto.offset.reset'='earliest'`
> > >>>>>> which
> > >>>>>>>> is very helpful IMO.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Jark
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Tue, 10 Mar 2020 at 22:29, Timo Walther <[email protected]>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Danny,
> > >>>>>>>>>
> > >>>>>>>>> compared to the hints, FLIP-110 is fully compliant to the SQL
> > >>>>>> standard.
> > >>>>>>>>>
> > >>>>>>>>> I don't think that `CREATE TEMPORARY TABLE Temp (LIKE t) WITH
> > >>>>> (k=v)`
> > >>>>>> is
> > >>>>>>>>> too verbose or awkward for the power of basically changing the
> > >>>>> entire
> > >>>>>>>>> connector. Usually, this statement would just precede the query
> > in
> > >>>>> a
> > >>>>>>>>> multiline file. So it can be change "in-place" like the hints
> you
> > >>>>>>>> proposed.
> > >>>>>>>>>
> > >>>>>>>>> Many companies have a well-defined set of tables that should be
> > >>>>> used.
> > >>>>>>> It
> > >>>>>>>>> would be dangerous if users can change the path or topic in a
> > hint.
> > >>>>>> The
> > >>>>>>>>> catalog/catalog manager should be the entity that controls
> which
> > >>>>>> tables
> > >>>>>>>>> exist and how they can be accessed.
> > >>>>>>>>>
> > >>>>>>>>>> what’s the problem there if we user the table hints to support
> > >>>>>>> “start
> > >>>>>>>>> offset”?
> > >>>>>>>>>
> > >>>>>>>>> IMHO it violates the meaning of a hint. According to the
> > >>>>> dictionary,
> > >>>>>> a
> > >>>>>>>>> hint is "a statement that expresses indirectly what one prefers
> > not
> > >>>>>> to
> > >>>>>>>>> say explicitly". But offsets are a property that are very
> > explicit.
> > >>>>>>>>>
> > >>>>>>>>> If we go with the hint approach, it should be expressible in
> the
> > >>>>>>>>> TableSourceFactory which properties are supported for hinting.
> Or
> > >>>>> do
> > >>>>>>> you
> > >>>>>>>>> plan to offer those hints in a separate Map<String, String>
> that
> > >>>>>> cannot
> > >>>>>>>>> overwrite existing properties? I think this would be a
> different
> > >>>>>>> story...
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Timo
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 10.03.20 10:34, Danny Chan wrote:
> > >>>>>>>>>> Thanks Timo ~
> > >>>>>>>>>>
> > >>>>>>>>>> Personally I would say that offset > 0 and start offset = 10
> > does
> > >>>>>> not
> > >>>>>>>>> have the same semantic, so from the SQL aspect, we can not
> > >>>>> implement
> > >>>>>> a
> > >>>>>>>>> “starting offset” hint for query with such a syntax.
> > >>>>>>>>>>
> > >>>>>>>>>> And the CREATE TABLE LIKE syntax is a DDL which is just
> verbose
> > >>>>> for
> > >>>>>>>>> defining such dynamic parameters even if it could do that,
> shall
> > we
> > >>>>>>> force
> > >>>>>>>>> users to define a temporal table for each query with dynamic
> > >>>>> params,
> > >>>>>> I
> > >>>>>>>>> would say it’s an awkward solution.
> > >>>>>>>>>>
> > >>>>>>>>>> "Hints should give "hints" but not affect the actual produced
> > >>>>>>> result.”
> > >>>>>>>>> You mentioned that multiple times and could we give a reason,
> > >>>>> what’s
> > >>>>>>> the
> > >>>>>>>>> problem there if we user the table hints to support “start
> > offset”
> > >>>>> ?
> > >>>>>>> From
> > >>>>>>>>> my side I saw some benefits for that:
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> • It’s very convent to set up these parameters, the syntax is
> > >>>>> very
> > >>>>>>> much
> > >>>>>>>>> like the DDL definition
> > >>>>>>>>>> • It’s scope is very clear, right on the table it attathed
> > >>>>>>>>>> • It does not affect the table schema, which means in order to
> > >>>>>>> specify
> > >>>>>>>>> the offset, there is no need to define an offset column which
> is
> > >>>>>> weird
> > >>>>>>>>> actually, offset should never be a column, it’s more like a
> > >>>>> metadata
> > >>>>>>> or a
> > >>>>>>>>> start option.
> > >>>>>>>>>>
> > >>>>>>>>>> So in total, FLIP-110 uses the offset more like a Hive
> partition
> > >>>>>>> prune,
> > >>>>>>>>> we can do that if we have an offset column, but most of the
> case
> > we
> > >>>>>> do
> > >>>>>>>> not
> > >>>>>>>>> define that, so there is actually no conflict or overlap.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Danny Chan
> > >>>>>>>>>> 在 2020年3月10日 +0800 PM4:28，Timo Walther <[email protected]
> >，写道：
> > >>>>>>>>>>> Hi Danny,
> > >>>>>>>>>>>
> > >>>>>>>>>>> shouldn't FLIP-110[1] solve most of the problems we have
> around
> > >>>>>>>> defining
> > >>>>>>>>>>> table properties more dynamically without manual schema work?
> > >>>>> Also
> > >>>>>>>>>>> offset definition is easier with such a syntax. They must not
> > be
> > >>>>>>>> defined
> > >>>>>>>>>>> in catalog but could be temporary tables that extend from the
> > >>>>>>> original
> > >>>>>>>>>>> table.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In general, we should aim to keep the syntax concise and
> don't
> > >>>>>>> provide
> > >>>>>>>>>>> too many ways of doing the same thing. Hints should give
> > "hints"
> > >>>>>> but
> > >>>>>>>> not
> > >>>>>>>>>>> affect the actual produced result.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Some connector properties might also change the plan or
> schema
> > >>>>> in
> > >>>>>>> the
> > >>>>>>>>>>> future. E.g. they might also define whether a table source
> > >>>>>> supports
> > >>>>>>>>>>> certain push-downs (e.g. predicate push-down).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Dawid is currently working a draft that might makes it
> possible
> > >>>>> to
> > >>>>>>>>>>> expose a Kafka offset via the schema such that `SELECT * FROM
> > >>>>>> Topic
> > >>>>>>>>>>> WHERE offset > 10` would become possible and could be pushed
> > >>>>> down.
> > >>>>>>> But
> > >>>>>>>>>>> this is of course, not planned initially.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Timo
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1]
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-110%3A+Support+LIKE+clause+in+CREATE+TABLE
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 10.03.20 08:34, Danny Chan wrote:
> > >>>>>>>>>>>> Thanks Wenlong ~
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For PROPERTIES Hint Error handling
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Actually we have no way to figure out whether a error prone
> > >>>>> hint
> > >>>>>>> is a
> > >>>>>>>>> PROPERTIES hint, for example, if use writes a hint like
> > >>>>> ‘PROPERTIAS’,
> > >>>>>>> we
> > >>>>>>>> do
> > >>>>>>>>> not know if this hint is a PROPERTIES hint, what we know is
> that
> > >>>>> the
> > >>>>>>> hint
> > >>>>>>>>> name was not registered in our Flink.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> If the user writes the hint name correctly (i.e.
> PROPERTIES),
> > >>>>> we
> > >>>>>>> did
> > >>>>>>>>> can enforce the validation of the hint options though the
> > pluggable
> > >>>>>>>>> HintOptionChecker.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For PROPERTIES Hint Option Format
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For a key value style hint option, the key can be either a
> > >>>>> simple
> > >>>>>>>>> identifier or a string literal, which means that it’s
> compatible
> > >>>>> with
> > >>>>>>> our
> > >>>>>>>>> DDL syntax. We support simple identifier because many other
> hints
> > >>>>> do
> > >>>>>>> not
> > >>>>>>>>> have the component complex keys like the table properties, and
> we
> > >>>>>> want
> > >>>>>>> to
> > >>>>>>>>> unify the parse block.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Danny Chan
> > >>>>>>>>>>>> 在 2020年3月10日 +0800 PM3:19，wenlong.lwl <
> > [email protected]
> > >>>>>>>> ，写道：
> > >>>>>>>>>>>>> Hi Danny, thanks for the proposal. +1 for adding table
> hints,
> > >>>>> it
> > >>>>>>> is
> > >>>>>>>>> really
> > >>>>>>>>>>>>> a necessary feature for flink sql to integrate with a
> > catalog.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For error handling, I think it would be more natural to
> throw
> > >>>>> an
> > >>>>>>>>>>>>> exception when error table hint provided, because the
> > >>>>> properties
> > >>>>>>> in
> > >>>>>>>>> hint
> > >>>>>>>>>>>>> will be merged and used to find the table factory which
> would
> > >>>>>>> cause
> > >>>>>>>> an
> > >>>>>>>>>>>>> exception when error properties provided, right? On the
> other
> > >>>>>>> hand,
> > >>>>>>>>> unlike
> > >>>>>>>>>>>>> other hints which just affect the way to execute the query,
> > >>>>> the
> > >>>>>>>>> property
> > >>>>>>>>>>>>> table hint actually affects the result of the query, we
> > should
> > >>>>>>> never
> > >>>>>>>>> ignore
> > >>>>>>>>>>>>> the given property hints.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For the format of property hints, currently, in sql client,
> > we
> > >>>>>>>> accept
> > >>>>>>>>>>>>> properties in format of string only in DDL:
> > >>>>>>>> 'connector.type'='kafka',
> > >>>>>>>>> I
> > >>>>>>>>>>>>> think the format of properties in hint should be the same
> as
> > >>>>> the
> > >>>>>>>>> format we
> > >>>>>>>>>>>>> defined in ddl. What do you think?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Bests,
> > >>>>>>>>>>>>> Wenlong Lyu
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, 10 Mar 2020 at 14:22, Danny Chan <
> > >>>>> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> To Weike: About the Error Handing
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> To be consistent with other SQL vendors, the default is to
> > >>>>> log
> > >>>>>>>>> warnings
> > >>>>>>>>>>>>>> and if there is any error (invalid hint name or options),
> > the
> > >>>>>>> hint
> > >>>>>>>>> is just
> > >>>>>>>>>>>>>> ignored. I have already addressed in the wiki.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> To Timo: About the PROPERTIES Table Hint
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> • The properties hints is also optional, user can pass in
> an
> > >>>>>>> option
> > >>>>>>>>> to
> > >>>>>>>>>>>>>> override the table properties but this does not mean it is
> > >>>>>>>> required.
> > >>>>>>>>>>>>>> • They should not include semantics: does the properties
> > >>>>> belong
> > >>>>>>> to
> > >>>>>>>>>>>>>> semantic ? I don't think so, the plan does not change
> right
> > ?
> > >>>>>> The
> > >>>>>>>>> result
> > >>>>>>>>>>>>>> set may be affected, but there are already some hints do
> so,
> > >>>>>> for
> > >>>>>>>>> example,
> > >>>>>>>>>>>>>> MS-SQL MAXRECURSION and SNAPSHOT hint [1]
> > >>>>>>>>>>>>>> • `SELECT * FROM t(k=v, k=v)`: this grammar breaks the SQL
> > >>>>>>> standard
> > >>>>>>>>>>>>>> compared to the hints way(which is included in comments)
> > >>>>>>>>>>>>>> • I actually didn't found any vendors to support such
> > >>>>> grammar,
> > >>>>>>> and
> > >>>>>>>>> there
> > >>>>>>>>>>>>>> is no way to override table level properties dynamically.
> > For
> > >>>>>>>> normal
> > >>>>>>>>> RDBMS,
> > >>>>>>>>>>>>>> I think there are no requests for such dynamic parameters
> > >>>>>> because
> > >>>>>>>>> all the
> > >>>>>>>>>>>>>> table have the same storage and computation and they are
> > >>>>> almost
> > >>>>>>> all
> > >>>>>>>>> batch
> > >>>>>>>>>>>>>> tables.
> > >>>>>>>>>>>>>> • While Flink as a computation engine has many connectors,
> > >>>>>>>>> especially for
> > >>>>>>>>>>>>>> some message queue like Kafka, we would have a
> start_offset
> > >>>>>> which
> > >>>>>>>> is
> > >>>>>>>>>>>>>> different each time we start the query, such parameters
> can
> > >>>>> not
> > >>>>>>> be
> > >>>>>>>>>>>>>> persisted to catalog, because it’s not static, this is
> > >>>>> actually
> > >>>>>>> the
> > >>>>>>>>>>>>>> background we propose the table hints to indicate such
> > >>>>>> properties
> > >>>>>>>>>>>>>> dynamically.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> To Jark and Jinsong: I have removed the query hints part
> and
> > >>>>>>> change
> > >>>>>>>>> the
> > >>>>>>>>>>>>>> title.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> >
> https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>> Danny Chan
> > >>>>>>>>>>>>>> 在 2020年3月9日 +0800 PM5:46，Timo Walther <[email protected]
> > >>>>>> ，写道：
> > >>>>>>>>>>>>>>> Hi Danny,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> thanks for the proposal. I agree with Jark and Jingsong.
> > >>>>>> Planner
> > >>>>>>>>> hints
> > >>>>>>>>>>>>>>> and table hints are orthogonal topics that should be
> > >>>>> discussed
> > >>>>>>>>>>>>>> separately.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I share Jingsong's opinion that we should not use planner
> > >>>>>> hints
> > >>>>>>>> for
> > >>>>>>>>>>>>>>> passing connector properties. Planner hints should be
> > >>>>> optional
> > >>>>>>> at
> > >>>>>>>>> any
> > >>>>>>>>>>>>>>> time. They should not include semantics but only affect
> > >>>>>>> execution
> > >>>>>>>>> time.
> > >>>>>>>>>>>>>>> Connector properties are an important part of the query
> > >>>>>> itself.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Have you thought about options such as `SELECT * FROM
> > t(k=v,
> > >>>>>>>> k=v)`?
> > >>>>>>>>> How
> > >>>>>>>>>>>>>>> are other vendors deal with this problem?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>> Timo
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On 09.03.20 10:37, Jingsong Li wrote:
> > >>>>>>>>>>>>>>>> Hi Danny, +1 for table hints, thanks for driving.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I took a look to FLIP, most of content are talking about
> > >>>>>> query
> > >>>>>>>>> hints.
> > >>>>>>>>>>>>>> It is
> > >>>>>>>>>>>>>>>> hard to discussion and voting. So +1 to split it as Jark
> > >>>>>> said.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Another thing is configuration that suitable to config
> > with
> > >>>>>>> table
> > >>>>>>>>>>>>>> hints:
> > >>>>>>>>>>>>>>>> "connector.path" and "connector.topic", Are they really
> > >>>>>>> suitable
> > >>>>>>>>> for
> > >>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>> hints? Looks weird to me. Because I think these
> properties
> > >>>>>> are
> > >>>>>>>> the
> > >>>>>>>>>>>>>> core of
> > >>>>>>>>>>>>>>>> table.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Mon, Mar 9, 2020 at 5:30 PM Jark Wu <
> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks Danny for starting the discussion.
> > >>>>>>>>>>>>>>>>> +1 for this feature.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If we just focus on the table hints not the query hints
> > in
> > >>>>>>> this
> > >>>>>>>>>>>>>> release,
> > >>>>>>>>>>>>>>>>> could you split the FLIP into two FLIPs?
> > >>>>>>>>>>>>>>>>> Because it's hard to vote on partial part of a FLIP.
> You
> > >>>>> can
> > >>>>>>>> keep
> > >>>>>>>>>>>>>> the table
> > >>>>>>>>>>>>>>>>> hints proposal in FLIP-113 and move query hints into
> > >>>>> another
> > >>>>>>>> FLIP.
> > >>>>>>>>>>>>>>>>> So that we can focuse on the table hints in the FLIP.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Mon, 9 Mar 2020 at 17:14, DONG, Weike <
> > >>>>>>>> [email protected]
> > >>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi Danny,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> This is a nice feature, +1.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> One thing I am interested in but not mentioned in the
> > >>>>>>> proposal
> > >>>>>>>> is
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> error
> > >>>>>>>>>>>>>>>>>> handling, as it is quite common for users to write
> > >>>>>>>> inappropriate
> > >>>>>>>>>>>>>> hints in
> > >>>>>>>>>>>>>>>>>> SQL code, if illegal or "bad" hints are given, would
> the
> > >>>>>>> system
> > >>>>>>>>>>>>>> simply
> > >>>>>>>>>>>>>>>>>> ignore them or throw exceptions?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks : )
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>> Weike
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Mon, Mar 9, 2020 at 5:02 PM Danny Chan <
> > >>>>>>>> [email protected]>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Note:
> > >>>>>>>>>>>>>>>>>>> we only plan to support table hints in Flink release
> > >>>>> 1.11,
> > >>>>>>> so
> > >>>>>>>>>>>>>> please
> > >>>>>>>>>>>>>>>>>> focus
> > >>>>>>>>>>>>>>>>>>> mainly on the table hints part and just ignore the
> > >>>>> planner
> > >>>>>>>>>>>>>> hints, sorry
> > >>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> that mistake ~
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>> Danny Chan
> > >>>>>>>>>>>>>>>>>>> 在 2020年3月9日 +0800 PM4:36，Danny Chan <
> > >>>>> [email protected]
> > >>>>>>>>> ，写道：
> > >>>>>>>>>>>>>>>>>>>> Hi, fellows ~
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I would like to propose the supports for SQL hints
> for
> > >>>>>> our
> > >>>>>>>>>>>>>> Flink SQL.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We would support hints syntax as following:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> select /*+ NO_HASH_JOIN, RESOURCE(mem='128mb',
> > >>>>>>>>>>>>>> parallelism='24') */
> > >>>>>>>>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>>>>>> emp /*+ INDEX(idx1, idx2) */
> > >>>>>>>>>>>>>>>>>>>> join
> > >>>>>>>>>>>>>>>>>>>> dept /*+ PROPERTIES(k1='v1', k2='v2') */
> > >>>>>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>> emp.deptno = dept.deptno
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Basically we would support both query hints(after
> the
> > >>>>>>> SELECT
> > >>>>>>>>>>>>>> keyword)
> > >>>>>>>>>>>>>>>>>>> and table hints(after the referenced table name), for
> > >>>>>> 1.11,
> > >>>>>>> we
> > >>>>>>>>>>>>>> plan to
> > >>>>>>>>>>>>>>>>>> only
> > >>>>>>>>>>>>>>>>>>> support table hints with a hint probably named
> > >>>>> PROPERTIES:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> table_name /*+ PROPERTIES(k1='v1', k2='v2') *+/
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I am looking forward to your comments.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> You can access the FLIP here:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+SQL+and+Planner+Hints
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>> Danny Chan
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> > >
> >
> >
>


-- 

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: [email protected]; [email protected]

Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Reply via email to