Hi everyone,
+1 to Kurt's suggestion. Let's just have it in source and sink factories
for now. We can still move this method up in the future. Currently, I
don't see a need for catalogs or formats. Because how would you target a
format in the query?
@Danny: Can you send a link to your PoC? I'm very skeptical about
creating a new CatalogTable in planner. Actually CatalogTable should be
immutable between Catalog and Factory. Because a catalog can return its
own factory and fully control the instantiation. Depending on the
implementation, that means it can be possible that the catalog has
encoded more information in a concrete subclass implementing the
interface. I vote for separating the concerns of catalog information and
hints in the factory explicitly.
Regards,
Timo
On 18.03.20 05:41, Jingsong Li wrote:
Hi,
I am thinking we can provide hints to *table* related instances.
- TableFormatFactory: of cause we need hints support, there are many format
options in DDL too.
- catalog and module: I don't know, maybe in future we can provide some
hints for them.
Best,
Jingsong Lee
On Wed, Mar 18, 2020 at 12:28 PM Danny Chan <yuzhao....@gmail.com> wrote:
Yes, I think we should move the `supportedHintOptions` from TableFactory
to TableSourceFactory, and we also need to add the interface to
TableSinkFactory though because sink target table may also have hints
attached.
Best,
Danny Chan
在 2020年3月18日 +0800 AM11:08,Kurt Young <ykt...@gmail.com>,写道:
Have one question for adding `supportedHintOptions` method to
`TableFactory`. It seems
`TableFactory` is a base factory interface for all *table module* related
instances, such as
catalog, module, format and so on. It's not created only for *table*. Is
it
possible to move it
to `TableSourceFactory`?
Best,
Kurt
On Wed, Mar 18, 2020 at 10:59 AM Danny Chan <yuzhao....@gmail.com>
wrote:
Thanks Timo ~
For the naming itself, I also think the PROPERTIES is not that
concise, so
+1 for OPTIONS (I had thought about that, but there are many codes in
current Flink called it properties, i.e. the DescriptorProperties,
#getSupportedProperties), let’s use OPTIONS if this is our new
preference.
+1 to `Set<ConfigOption> supportedHintOptions()` because the
ConfigOption
can take more info. AFAIK, Spark also call their table options instead
of
properties. [1]
In my local POC, I did create a new CatalogTable, and it works for
current
connectors well, all the DDL tables would finally yield a CatalogTable
instance and we can apply the options to that(in the CatalogSourceTable
when we generating the TableSource), the pros is that we do not need to
modify the codes of connectors itself. If we split the options from
CatalogTable, we may need to add some additional logic in each
connector
factories in order to merge these properties (and the logic are almost
the
same), what do you think about this?
[1]
https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html
Best,
Danny Chan
在 2020年3月17日 +0800 PM10:10,Timo Walther <twal...@apache.org>,写道:
Hi Danny,
thanks for updating the FLIP. I think your current design is
sufficient
to separate hints from result-related properties.
One remark to the naming itself: I would vote for calling the hints
around table scan `OPTIONS('k'='v')`. We used the term "properties"
in
the past but since we want to unify the Flink configuration
experience,
we should use consistent naming and classes around `ConfigOptions`.
It would be nice to use `Set<ConfigOption> supportedHintOptions();`
to
start using config options instead of pure string properties. This
will
also allow us to generate documentation in the future around
supported
data types, ranges, etc. for options. At some point we would also
like
to drop `DescriptorProperties` class. "Options" is also used in the
documentation [1] and in the SQL/MED standard [2].
Furthermore, I would still vote for separating CatalogTable and hint
options. Otherwise the planner would need to create a new
CatalogTable
instance which might not always be easy. We should offer them via:
org.apache.flink.table.factories.TableSourceFactory.Context#getHints:
ReadableConfig
What do you think?
Regards,
Timo
[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table
[2] https://wiki.postgresql.org/wiki/SQL/MED
On 12.03.20 15:06, Stephan Ewen wrote:
@Danny sounds good.
Maybe it is worth listing all the classes of problems that you
want to
address and then look at each class and see if hints are a good
default
solution or a good optional way of simplifying things?
The discussion has grown a lot and it is starting to be hard to
distinguish
the parts where everyone agrees from the parts were there are
concerns.
On Thu, Mar 12, 2020 at 2:31 PM Danny Chan <danny0...@apache.org>
wrote:
Thanks Stephan ~
We can remove the support for properties that may change the
semantics of
query if you think that is a trouble.
How about we support the /*+ properties() */ hint only for those
optimize
parameters, such as the fetch size of source or something like
that,
does
that make sense?
Stephan Ewen <se...@apache.org>于2020年3月12日 周四下午7:45写道:
I think Bowen has actually put it very well.
(1) Hints that change semantics looks like trouble waiting to
happen. For
example Kafka offset handling should be in filters. The Kafka
source
should
support predicate pushdown.
(2) Hints should not be a workaround for current shortcomings.
A
lot of
the
suggested above sounds exactly like that. Working around
catalog/DDL
shortcomings, missing exposure of metadata (offsets), missing
predicate
pushdown in Kafka. Abusing a feature like hints now as a quick
fix
for
these issues, rather than fixing the root causes, will much
likely
bite
us
back badly in the future.
Best,
Stephan
On Thu, Mar 12, 2020 at 10:43 AM Kurt Young <ykt...@gmail.com>
wrote:
It seems this FLIP's name is somewhat misleading. From my
understanding,
this FLIP is trying to
address the dynamic parameter issue, and table hints is the
way
we wan
to
choose. I think we should
be focus on "what's the right way to solve dynamic property"
instead of
discussing "whether table
hints can affect query semantics".
For now, there are two proposed ways to achieve dynamic
property:
1. FLIP-110: create temporary table xx like xx with (xxx)
2. use custom "from t with (xxx)" syntax
3. "Borrow" the table hints to have a special PROPERTIES
hint.
The first one didn't break anything, but the only problem i
see
is a
little
more verbose than the table hint
approach. I can imagine when someone using SQL CLI to have a
sql
experience, it's quite often that
he will modify the table property, some use cases i can
think of:
1. the source contains some corrupted data, i want to turn
on the
"ignore-error" flag for certain formats.
2. I have a kafka table and want to see some sample data
from the
beginning, so i change the offset
to "earliest", and then I want to observe the latest data
which
keeps
coming in. I would write another query
to select from the latest table.
3. I want to my jdbc sink flush data more eagerly then i can
observe
the
data from database side.
Most of such use cases are quite ad-hoc. If every time I
want to
have a
different experience, i need to create
a temporary table and then also modify my query, it doesn't
feel
smooth.
Embed such dynamic property into
query would have better user experience.
Both 2 & 3 can make this happen. The cons of #2 is breaking
SQL
compliant,
and for #3, it only breaks some
unwritten rules, but we can have an explanation on that. And
I
really
doubt
whether user would complain about
this when they actually have flexible and good experience
using
this.
My tendency would be #3 > #1 > #2, what do you think?
Best,
Kurt
On Thu, Mar 12, 2020 at 1:11 PM Danny Chan <
yuzhao....@gmail.com
wrote:
Thanks Aljoscha ~
I agree for most of the query hints, they are optional as
an
optimizer
instruction, especially for the traditional RDBMS.
But, just like BenChao said, Flink as a computation engine
has
many
different kind of data sources, thus, dynamic parameters
like
start_offest
can only bind to each table scope, we can not set a session
config
like
KSQL because they are all about Kafka:
SET ‘auto.offset.reset’=‘earliest’;
Thus the most flexible way to set up these dynamic params
is
to bind
to
the table scope in the query when we want to override
something, so
we
have
these solutions above (with pros and cons from my side):
• 1. Select * from t(offset=123) (from Timo)
Pros:
- Easy to add
- Parameters are part of the main query
Cons:
- Not SQL compliant
• 2. Select * from t /*+ PROPERTIES(offset=123) */ (from
me)
Pros:
- Easy to add
- SQL compliant because it is nested in the comments
Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users
The biggest problem for hints way may be the “if hints
must be
optional”,
actually we have though about 1 for a while but aborted
because it
breaks
the SQL standard too much. And we replace it with 2,
because
the
hints
syntax do not break SQL standard(nested in comments).
What if we have the special /*+ PROPERTIES */ hint that
allows
override
some properties of table dynamically, it does not break
anything, at
lease
for current Flink use cases.
Planner hints are optional just because they are naturally
enforcers
of
the planner, most of them aim to instruct the optimizer,
but,
the
table
hints is a little different, table hints can specify the
table
meta
like
index column, and it is very convenient to specify table
properties.
Or shall we not call /*+ PROPERTIES(offset=123) */ table
hint,
we
can
call it table dynamic parameters.
Best,
Danny Chan
在 2020年3月11日 +0800 PM9:20,Aljoscha Krettek <
aljos...@apache.org>,写道:
Hi,
I don't understand this discussion. Hints, as I
understand
them,
should
work like this:
- hints are *optional* advice for the optimizer to try
and
help it
to
find a good execution strategy
- hints should not change query semantics, i.e. they
should
not
change
connector properties executing a query with taking into
account the
hints *must* produce the same result as executing the
query
without
taking into account the hints
From these simple requirements you can derive a solution
that makes
sense. I don't have a strong preference for the syntax
but we
should
strive to be in line with prior work.
Best,
Aljoscha
On 11.03.20 11:53, Danny Chan wrote:
Thanks Timo for summarize the 3 options ~
I agree with Kurt that option2 is too complicated to
use
because:
• As a Kafka topic consumer, the user must define both
the
virtual
column for start offset and he must apply a special filter
predicate
after
each query
• And for the internal implementation, the metadata
column
push
down
is another hard topic, each kind of message queue may have
its
offset
attribute, we need to consider the expression type for
different
kind;
the
source also need to recognize the constant column as a
config
option(which
is weird because usually what we pushed down is a table
column)
For option 1 and option3, I think there is no
difference,
option1
is
also a hint syntax which is introduced in Sybase and
referenced then
deprecated by MS-SQL in 199X years because of the
ambitiousness.
Personally
I prefer /*+ */ style table hint than WITH keyword for
these
reasons:
• We do not break the standard SQL, the hints are
nested
in SQL
comments
• We do not need to introduce additional WITH keyword
which may
appear
in a query if we use that because a table can be
referenced in
all
kinds
of
SQL contexts: INSERT/DELETE/FROM/JOIN …. That would make
our
sql
query
break too much of the SQL from standard
• We would have uniform syntax for hints as query
hint, one
syntax
fits all and more easy to use
And here is the reason why we choose a uniform Oracle
style query
hint syntax which is addressed by Julian Hyde when we
design
the
syntax
from the Calcite community:
I don’t much like the MSSQL-style syntax for table
hints.
It
adds a
new use of the WITH keyword that is unrelated to the use of
WITH for
common-table expressions.
A historical note. Microsoft SQL Server inherited its
hint
syntax
from
Sybase a very long time ago. (See “Transact SQL
Programming”[1], page
632,
“Optimizer hints”. The book was written in 1999, and covers
Microsoft
SQL
Server 6.5 / 7.0 and Sybase Adaptive Server 11.5, but the
syntax very
likely predates Sybase 4.3, from which Microsoft SQL
Server was
forked
in
1993.)
Microsoft later added the WITH keyword to make it less
ambiguous,
and
has now deprecated the syntax that does not use WITH.
They are forced to keep the syntax for backwards
compatibility
but
that doesn’t mean that we should shoulder their burden.
I think formatted comments are the right container for
hints
because
it allows us to change the hint syntax without changing
the SQL
parser,
and
makes clear that we are at liberty to ignore hints
entirely.
Julian
[1] https://www.amazon.com/s?k=9781565924017 <
https://www.amazon.com/s?k=9781565924017>
Best,
Danny Chan
在 2020年3月11日 +0800 PM4:03,Timo Walther <
twal...@apache.org
,写道:
Hi Danny,
it is true that our DDL is not standard compliant by
using the
WITH
clause. Nevertheless, we aim for not diverging too
much
and the
LIKE
clause is an example of that. It will solve things
like
overwriting
WATERMARKs, add additional/modifying properties and
inherit
schema.
Bowen is right that Flink's DDL is mixing 3 types
definition
together.
We are not the first ones that try to solve this.
There
is also
the
SQL
MED standard [1] that tried to tackle this problem. I
think it
was
not
considered when designing the current DDL.
Currently, I see 3 options for handling Kafka
offsets. I
will
give
some
examples and look forward to feedback here:
*Option 1* Runtime and semantic parms as part of the
query
`SELECT * FROM MyTable('offset'=123)`
Pros:
- Easy to add
- Parameters are part of the main query
- No complicated hinting syntax
Cons:
- Not SQL compliant
*Option 2* Use metadata in query
`CREATE TABLE MyTable (id INT, offset AS
SYSTEM_METADATA('offset'))`
`SELECT * FROM MyTable WHERE offset > TIMESTAMP
'2012-12-12
12:34:22'`
Pros:
- SQL compliant in the query
- Access of metadata in the DDL which is required
anyway
- Regular pushdown rules apply
Cons:
- Users need to add an additional comlumn in the DDL
*Option 3*: Use hints for properties
`
SELECT *
FROM MyTable /*+ PROPERTIES('offset'=123) */
`
Pros:
- Easy to add
Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users
- Not standard compliant.
If we go with this option, I would suggest to make it
available
in
a
separate map and don't mix it with statically defined
properties.
Such
that the factory can decide which properties have the
right to
be
overwritten by the hints:
TableSourceFactory.Context.getQueryHints():
ReadableConfig
Regards,
Timo
[1] https://en.wikipedia.org/wiki/SQL/MED
Currently I see 3 options as a
On 11.03.20 07:21, Danny Chan wrote:
Thanks Bowen ~
I agree we should somehow categorize our connector
parameters.
For type1, I’m already preparing a solution like
the
Confluent
schema registry + Avro schema inference thing, so this may
not
be a
problem
in the near future.
For type3, I have some questions:
"SELECT * FROM mykafka WHERE offset > 12pm
yesterday”
Where does the offset column come from, a virtual
column from
the
table schema, you said that
They change
almost every time a query starts and have nothing
to
do with
metadata, thus
should not be part of table definition/DDL
But why you can reference it in the query, I’m
confused for
that,
can you elaborate a little ?
Best,
Danny Chan
在 2020年3月11日 +0800 PM12:52,Bowen Li <
bowenl...@gmail.com
,写道:
Thanks Danny for kicking off the effort
The root cause of too much manual work is Flink
DDL
has
mixed 3
types of
params together and doesn't handle each of them
very
well.
Below
are how I
categorize them and corresponding solutions in my
mind:
- type 1: Metadata of external data, like
external
endpoint/url,
username/pwd, schemas, formats.
Such metadata are mostly already accessible in
external
system
as long as
endpoints and credentials are provided. Flink can
get it
thru
catalogs, but
we haven't had many catalogs yet and thus Flink
just
hasn't
been
able to
leverage that. So the solution should be building
more
catalogs.
Such
params should be part of a Flink table
DDL/definition, and
not
overridable
in any means.
- type 2: Runtime params, like jdbc connector's
fetch size,
elasticsearch
connector's bulk flush size.
Such params don't affect query results, but
affect
how
results
are produced
(eg. fast or slow, aka performance) - they are
essentially
execution and
implementation details. They change often in
exploration or
development
stages, but not quite frequently in well-defined
long-running
pipelines.
They should always have default values and can be
missing
in
query. They
can be part of a table DDL/definition, but should
also be
replaceable in a
query - *this is what table "hints" in FLIP-113
should
cover*.
- type 3: Semantic params, like kafka connector's
start
offset.
Such params affect query results - the semantics.
They'd
better
be as
filter conditions in WHERE clause that can be
pushed
down.
They
change
almost every time a query starts and have
nothing to
do
with
metadata, thus
should not be part of table definition/DDL, nor
be
persisted
in
catalogs.
If they will, users should create views to keep
such
params
around (note
this is different from variable substitution).
Take Flink-Kafka as an example. Once we get these
params
right,
here're the
steps users need to do to develop and run a Flink
job:
- configure a Flink ConfluentSchemaRegistry with
url,
username,
and password
- run "SELECT * FROM mykafka WHERE offset > 12pm
yesterday"
(simplified
timestamp) in SQL CLI, Flink automatically
retrieves
all
metadata of
schema, file format, etc and start the job
- users want to make the job read Kafka topic
faster, so it
goes
as "SELECT
* FROM mykafka /* faster_read_key=value*/ WHERE
offset >
12pm
yesterday"
- done and satisfied, users submit it to
production
Regarding "CREATE TABLE t LIKE with (k1=v1,
k2=v2),
I think
it's
a
nice-to-have feature, but not a strategically
critical,
long-term solution,
because
1) It may seem promising at the current stage to
solve the
too-much-manual-work problem, but that's only
because Flink
hasn't
leveraged catalogs well and handled the 3 types
of
params
above
properly.
Once we get the params types right, the LIKE
syntax
won't
be
that
important, and will be just an easier way to
create
tables
without retyping
long fields like username and pwd.
2) Note that only some rare type of catalog can
store k-v
property pair, so
table created this way often cannot be
persisted. In
the
foreseeable
future, such catalog will only be HiveCatalog,
and
not
everyone
has a Hive
metastore. To be honest, without persistence,
recreating
tables
every time
this way is still a lot of keyboard typing.
Cheers,
Bowen
On Tue, Mar 10, 2020 at 8:07 PM Kurt Young <
ykt...@gmail.com
wrote:
If a specific connector want to have such
parameter and
read
if out of
configuration, then that's fine.
If we are talking about a configuration for all
kinds of
sources, I would
be super careful about that.
It's true it can solve maybe 80% cases, but it
will also
make
the left 20%
feels weird.
Best,
Kurt
On Wed, Mar 11, 2020 at 11:00 AM Jark Wu <
imj...@gmail.com
wrote:
Hi Kurt,
#3 Regarding to global offset:
I'm not saying to use the global
configuration to
override
connector
properties by the planner.
But the connector should take this
configuration
and
translate into their
client API.
AFAIK, almost all the message queues support
eariliest
and
latest and a
timestamp value as start point.
So we can support 3 options for this
configuration:
"eariliest", "latest"
and a timestamp string value.
Of course, this can't solve 100% cases, but I
guess can
sovle 80% or 90%
cases.
And the remaining cases can be resolved by
LIKE
syntax
which
I guess is
not
very common cases.
Best,
Jark
On Wed, 11 Mar 2020 at 10:33, Kurt Young <
ykt...@gmail.com
wrote:
Good to have such lovely discussions. I
also
want to
share
some of my
opinions.
#1 Regarding to error handling: I also
think
ignore
invalid hints would
be
dangerous, maybe
the simplest solution is just throw an
exception.
#2 Regarding to property replacement: I
don't
think
we
should
constraint
ourself to
the meaning of the word "hint", and
forbidden
it
modifying
any
properties
which can effect
query results. IMO `PROPERTIES` is one of
the
table
hints,
and a
powerful
one. It can
modify properties located in DDL's WITH
block.
But I
also
see the harm
that
if we make it
too flexible like change the kafka topic
name
with a
hint.
Such use
case
is
not common and
sounds very dangerous to me. I would
propose
we have
a
map
of hintable
properties for each
connector, and should validate all passed
in
properties
are actually
hintable. And combining with
#1 error handling, we can throw an
exception
once
received
invalid
property.
#3 Regarding to global offset: I'm not sure
it's
feasible.
Different
connectors will have totally
different properties to represent offset,
some
might
be
timestamps,
some
might be string literals
like "earliest", and others might be just
integers.
Best,
Kurt
On Tue, Mar 10, 2020 at 11:46 PM Jark Wu <
imj...@gmail.com>
wrote:
Hi everyone,
I want to jump in the discussion about
the
"dynamic
start offset"
problem.
First of all, I share the same concern
with
Timo
and
Fabian, that the
"start offset" affects the query
semantics,
i.e.
the
query result.
But "hints" is just used for optimization
which
should
affect the
result?
I think the "dynamic start offset" is an
very
important
usability
problem
which will be faced by many streaming
platforms.
I also agree "CREATE TEMPORARY TABLE Temp
(LIKE t)
WITH
('connector.startup-timestamp-millis' =
'1578538374471')" is verbose,
what
if we have 10 tables to join?
However, what I want to propose (should
be
another
thread) is a
global
configuration to reset start offsets of
all
the
source
connectors
in the query session, e.g.
"table.sources.start-offset".
This is
possible
now because `TableSourceFactory.Context`
has
`getConfiguration`
method to get the session configuration,
and
use it
to
create an
adapted
TableSource.
Then we can also expose to SQL CLI via
SET
command,
e.g.
`SET
'table.sources.start-offset'='earliest';`,
which is
pretty simple and
straightforward.
This is very similar to KSQL's `SET
'auto.offset.reset'='earliest'`
which
is very helpful IMO.
Best,
Jark
On Tue, 10 Mar 2020 at 22:29, Timo
Walther <
twal...@apache.org>
wrote:
Hi Danny,
compared to the hints, FLIP-110 is
fully
compliant
to
the SQL
standard.
I don't think that `CREATE TEMPORARY
TABLE
Temp
(LIKE
t) WITH
(k=v)`
is
too verbose or awkward for the power of
basically
changing the
entire
connector. Usually, this statement
would
just
precede
the query in
a
multiline file. So it can be change
"in-place"
like
the hints you
proposed.
Many companies have a well-defined set
of
tables
that
should be
used.
It
would be dangerous if users can change
the
path
or
topic in a hint.
The
catalog/catalog manager should be the
entity that
controls which
tables
exist and how they can be accessed.
what’s the problem there if we user
the
table
hints
to support
“start
offset”?
IMHO it violates the meaning of a hint.
According
to
the
dictionary,
a
hint is "a statement that expresses
indirectly
what
one prefers not
to
say explicitly". But offsets are a
property that
are
very explicit.
If we go with the hint approach, it
should
be
expressible in the
TableSourceFactory which properties are
supported
for
hinting. Or
do
you
plan to offer those hints in a separate
Map<String,
String> that
cannot
overwrite existing properties? I think
this would
be
a
different
story...
Regards,
Timo
On 10.03.20 10:34, Danny Chan wrote:
Thanks Timo ~
Personally I would say that offset >
0
and
start
offset = 10 does
not
have the same semantic, so from the SQL
aspect,
we
can
not
implement
a
“starting offset” hint for query with
such
a
syntax.
And the CREATE TABLE LIKE syntax is a
DDL which
is
just verbose
for
defining such dynamic parameters even
if
it could
do
that, shall we
force
users to define a temporal table for
each
query
with
dynamic
params,
I
would say it’s an awkward solution.
"Hints should give "hints" but not
affect the
actual
produced
result.”
You mentioned that multiple times and
could we
give a
reason,
what’s
the
problem there if we user the table
hints to
support
“start offset”
?
From
my side I saw some benefits for that:
• It’s very convent to set up these
parameters,
the
syntax is
very
much
like the DDL definition
• It’s scope is very clear, right on
the
table
it
attathed
• It does not affect the table
schema,
which
means
in order to
specify
the offset, there is no need to define
an
offset
column which is
weird
actually, offset should never be a
column,
it’s
more
like a
metadata
or a
start option.
So in total, FLIP-110 uses the offset
more
like a
Hive partition
prune,
we can do that if we have an offset
column, but
most
of the case we
do
not
define that, so there is actually no
conflict or
overlap.
Best,
Danny Chan
在 2020年3月10日 +0800 PM4:28,Timo
Walther <
twal...@apache.org>,写道:
Hi Danny,
shouldn't FLIP-110[1] solve most
of the
problems
we have around
defining
table properties more dynamically
without
manual
schema work?
Also
offset definition is easier with
such a
syntax.
They must not be
defined
in catalog but could be temporary
tables that
extend from the
original
table.
In general, we should aim to keep
the
syntax
concise and don't
provide
too many ways of doing the same
thing.
Hints
should give "hints"
but
not
affect the actual produced result.
Some connector properties might
also
change
the
plan or schema
in
the
future. E.g. they might also define
whether a
table source
supports
certain push-downs (e.g. predicate
push-down).
Dawid is currently working a draft
that might
makes it possible
to
expose a Kafka offset via the
schema
such
that
`SELECT * FROM
Topic
WHERE offset > 10` would become
possible and
could
be pushed
down.
But
this is of course, not planned
initially.
Regards,
Timo
[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-110%3A+Support+LIKE+clause+in+CREATE+TABLE
On 10.03.20 08:34, Danny Chan
wrote:
Thanks Wenlong ~
For PROPERTIES Hint Error
handling
Actually we have no way to
figure out
whether a
error prone
hint
is a
PROPERTIES hint, for example, if use
writes a
hint
like
‘PROPERTIAS’,
we
do
not know if this hint is a PROPERTIES
hint, what
we
know is that
the
hint
name was not registered in our Flink.
If the user writes the hint name
correctly
(i.e.
PROPERTIES),
we
did
can enforce the validation of the hint
options
though
the pluggable
HintOptionChecker.
For PROPERTIES Hint Option Format
For a key value style hint
option,
the key
can
be either a
simple
identifier or a string literal, which
means that
it’s
compatible
with
our
DDL syntax. We support simple
identifier
because
many
other hints
do
not
have the component complex keys like
the
table
properties, and we
want
to
unify the parse block.
Best,
Danny Chan
在 2020年3月10日 +0800
PM3:19,wenlong.lwl <
wenlong88....@gmail.com
,写道:
Hi Danny, thanks for the
proposal.
+1 for
adding table hints,
it
is
really
a necessary feature for flink
sql
to
integrate
with a catalog.
For error handling, I think it
would be
more
natural to throw
an
exception when error table hint
provided,
because the
properties
in
hint
will be merged and used to find
the table
factory which would
cause
an
exception when error properties
provided,
right? On the other
hand,
unlike
other hints which just affect
the
way to
execute the query,
the
property
table hint actually affects the
result of
the
query, we should
never
ignore
the given property hints.
For the format of property
hints,
currently,
in sql client, we
accept
properties in format of string
only in
DDL:
'connector.type'='kafka',
I
think the format of properties
in
hint
should
be the same as
the
format we
defined in ddl. What do you
think?
Bests,
Wenlong Lyu
On Tue, 10 Mar 2020 at 14:22,
Danny Chan
<
yuzhao....@gmail.com>
wrote:
To Weike: About the Error
Handing
To be consistent with other
SQL
vendors,
the
default is to
log
warnings
and if there is any error
(invalid hint
name
or options), the
hint
is just
ignored. I have already
addressed in
the
wiki.
To Timo: About the PROPERTIES
Table
Hint
• The properties hints is
also
optional,
user can pass in an
option
to
override the table properties
but this
does
not mean it is
required.
• They should not include
semantics:
does
the properties
belong
to
semantic ? I don't think so,
the
plan
does
not change right ?
The
result
set may be affected, but
there
are
already
some hints do so,
for
example,
MS-SQL MAXRECURSION and
SNAPSHOT
hint
[1]
• `SELECT * FROM t(k=v,
k=v)`:
this
grammar
breaks the SQL
standard
compared to the hints
way(which
is
included
in comments)
• I actually didn't found any
vendors
to
support such
grammar,
and
there
is no way to override table
level
properties
dynamically. For
normal
RDBMS,
I think there are no requests
for such
dynamic parameters
because
all the
table have the same storage
and
computation
and they are
almost
all
batch
tables.
• While Flink as a
computation
engine
has
many connectors,
especially for
some message queue like
Kafka,
we would
have
a start_offset
which
is
different each time we start
the
query,
such
parameters can
not
be
persisted to catalog, because
it’s not
static, this is
actually
the
background we propose the
table
hints
to
indicate such
properties
dynamically.
To Jark and Jinsong: I have
removed the
query hints part and
change
the
title.
[1]
https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15
Best,
Danny Chan
在 2020年3月9日 +0800 PM5:46,Timo
Walther <
twal...@apache.org
,写道:
Hi Danny,
thanks for the proposal. I
agree with
Jark
and Jingsong.
Planner
hints
and table hints are
orthogonal
topics
that
should be
discussed
separately.
I share Jingsong's opinion
that we
should
not use planner
hints
for
passing connector
properties.
Planner
hints should be
optional
at
any
time. They should not
include
semantics
but only affect
execution
time.
Connector properties are an
important
part
of the query
itself.
Have you thought about
options
such
as
`SELECT * FROM t(k=v,
k=v)`?
How
are other vendors deal with
this
problem?
Regards,
Timo
On 09.03.20 10:37,
Jingsong Li
wrote:
Hi Danny, +1 for table
hints,
thanks
for
driving.
I took a look to FLIP,
most
of
content
are talking about
query
hints.
It is
hard to discussion and
voting. So
+1
to
split it as Jark
said.
Another thing is
configuration that
suitable to config with
table
hints:
"connector.path" and
"connector.topic",
Are they really
suitable
for
table
hints? Looks weird to me.
Because I
think these properties
are
the
core of
table.
Best,
Jingsong Lee
On Mon, Mar 9, 2020 at
5:30
PM Jark
Wu
<
imj...@gmail.com>
wrote:
Thanks Danny for
starting
the
discussion.
+1 for this feature.
If we just focus on the
table
hints
not the query hints in
this
release,
could you split the
FLIP
into two
FLIPs?
Because it's hard to
vote
on
partial
part of a FLIP. You
can
keep
the table
hints proposal in
FLIP-113
and
move
query hints into
another
FLIP.
So that we can focuse
on
the
table
hints in the FLIP.
Thanks,
Jark
On Mon, 9 Mar 2020 at
17:14,
DONG,
Weike <
kyled...@connect.hku.hk
wrote:
Hi Danny,
This is a nice
feature,
+1.
One thing I am
interested in
but
not
mentioned in the
proposal
is
the
error
handling, as it is
quite
common
for
users to write
inappropriate
hints in
SQL code, if illegal
or
"bad"
hints
are given, would the
system
simply
ignore them or throw
exceptions?
Thanks : )
Best,
Weike
On Mon, Mar 9, 2020
at
5:02 PM
Danny
Chan <
yuzhao....@gmail.com>
wrote:
Note:
we only plan to
support table
hints in Flink release
1.11,
so
please
focus
mainly on the table
hints
part
and
just ignore the
planner
hints, sorry
for
that mistake ~
Best,
Danny Chan
在 2020年3月9日 +0800
PM4:36,Danny
Chan <
yuzhao....@gmail.com
,写道:
Hi, fellows ~
I would like to
propose the
supports for SQL hints for
our
Flink SQL.
We would support
hints
syntax
as
following:
select /*+
NO_HASH_JOIN,
RESOURCE(mem='128mb',
parallelism='24') */
from
emp /*+
INDEX(idx1,
idx2)
*/
join
dept /*+
PROPERTIES(k1='v1',
k2='v2') */
on
emp.deptno =
dept.deptno
Basically we
would
support
both
query hints(after the
SELECT
keyword)
and table
hints(after
the
referenced table name), for
1.11,
we
plan to
only
support table hints
with a
hint
probably named
PROPERTIES:
table_name /*+
PROPERTIES(k1='v1', k2='v2') *+/
I am looking
forward
to
your
comments.
You can access
the
FLIP
here:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+SQL+and+Planner+Hints
Best,
Danny Chan