Hi Bowen,
thanks for your proposal. Here are some thoughts:
1) We should not have the restriction "hive built-in functions can only
be used when current catalog is hive catalog". Switching a catalog
should only have implications on the cat.db.object resolution but not
functions. It would be quite convinient for users to use Hive built-ins
even if they use a Confluent schema registry or just the in-memory catalog.
2) I would propose to have separate concepts for catalog and built-in
functions. In particular it would be nice to modularize built-in
functions. Some built-in functions are very crucial (like AS, CAST,
MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
we add more experimental functions in the future or function for some
special application area (Geo functions, ML functions). A data platform
team might not want to make every built-in function available. Or a
function module like ML functions is in a different Maven module.
3) Following the suggestion above, we can have a separate discovery
mechanism for built-in functions. Instead of just going through a static
list like in BuiltInFunctionDefinitions, a platform team should be able
to select function modules like
catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions) or via service discovery;
3) Dawid and I discussed the resulution order again. I agree with Kurt
that we should unify built-in function (external or internal) under a
common layer. However, the resolution order should be:
1. built-in functions
2. temporary functions
3. regular catalog resolution logic
Otherwise a temporary function could cause clashes with Flink's built-in
functions. If you take a look at other vendors, like SQL Server they
also do not allow to overwrite built-in functions.
Regards,
Timo
On 03.09.19 10:35, JingsongLee wrote:
Thanks Bowen:
+1 for this. And +1 to Kurt's suggestion. My other points are:
1.Hive built-in functions is an intermediate solution. So we should
not introduce interfaces to influence the framework. To make
Flink itself more powerful, we should implement the functions
we need to add.
2.Non-flink built-in functions are easy for users to change their
behavior. If we support some flink built-in functions in the
future but act differently from non-flink built-in, this will lead to
changes in user behavior.
3.Fallback to Non-flink built-in functions is a bad choice to
performance. Without flink internal codegen and data format,
and bring data format conversion, the performance is not so
good.
We need to support more complete hive jobs now, we need to
have this fallback strategy. But it's not worth adding this
concept at the catalog interface level, and it's not worth
encouraging other catalogs to do so.
Another question is, does this fallback include all
hive built-in functions? As far as I know, some hive functions
have some hacky. If possible, can we start with a white list?
Once we implement some functions to flink built-in, we can
also update the whitelist.
Best,
Jingsong Lee
------------------------------------------------------------------
From:Kurt Young <ykt...@gmail.com>
Send Time:2019年9月3日(星期二) 15:41
To:dev <dev@flink.apache.org>
Subject:Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog
Thanks Bowen for driving this.
+1 for the general idea. It makes the function resolved behavior more
clear and deterministic. Besides, the user can use all hive built-in
functions, which is a great feature.
I only have one comment, but maybe it may touch your design so I think
it would make sense to reply this mail instead of comment on google doc.
Regarding to the classfication of functions, you currently have 4 types
of functions, which are:
1. temporary functions
2. Flink built-in functions
3. Hive built-in functions (or generalized as external built-in functions)
4. catalog functions
What I want to propose is we can merge #3 and #4, make them both under
"catalog" concept, by extending catalog function to make it have ability to
have built-in catalog functions. Some benefits I can see from this approach:
1. We don't have to introduce new concept like external built-in functions.
Actually
I don't see a full story about how to treat a built-in functions, and it
seems a little
bit disrupt with catalog. As a result, you have to make some restriction
like "hive
built-in functions can only be used when current catalog is hive catalog".
2. It makes us easier to adopt another system's built-in functions to
Flink, such as
MySQL. If we don't treat uniformly with "external built-in functions" and
"external
catalog function", things like user set current catalog to hive but want to
use MySQL's
built-in function will happen.
One more thing, follow this approach, it's clear for your question about
how to support
external built-in functions, which is "add a getBuiltInFunction to current
Catalog API".
What do you think?
Best,
Kurt
On Fri, Aug 30, 2019 at 7:14 AM Bowen Li <bowenl...@gmail.com> wrote:
Thanks everyone for the feedback.
I have updated the document accordingly. Here're the summary of changes:
- clarify the concept of temporary functions, to facilitate deciding
function resolution order
- provide two options to support Hive built-in functions, with the 2nd one
being preferred
- add detailed prototype code for FunctionCatalog#lookupFunction(name)
- move the section of ”rename existing FunctionCatalog APIs in favor of
temporary functions“ out of the scope of the FLIP
- add another reasonable limitation for function resolution, to not
consider resolving overloaded functions - those with the same name but
different params. (It's still valid to have a single function with
overloaded eval() methods)
Please take another look.
Thanks,
Bowen
On Tue, Aug 27, 2019 at 11:49 AM Bowen Li <bowenl...@gmail.com> wrote:
Hi folks,
I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
It's critically helpful to improve function usability in SQL.
https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
In short, it:
- adds support for precise function reference with fully/partially
qualified name
- redefines function resolution order for ambiguous function reference
- adds support for Hive's rich built-in functions (support for Hive user
defined functions was already added in 1.9.0)
- clarifies the concept of temporary functions
Would love to hear your thoughts.
Bowen