Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Timo Walther Tue, 03 Sep 2019 02:22:59 -0700

Hi Bowen,

thanks for your proposal. Here are some thoughts:

1) We should not have the restriction "hive built-in functions can onlybe used when current catalog is hive catalog". Switching a catalogshould only have implications on the cat.db.object resolution but notfunctions. It would be quite convinient for users to use Hive built-inseven if they use a Confluent schema registry or just the in-memory catalog.

2) I would propose to have separate concepts for catalog and built-infunctions. In particular it would be nice to modularize built-infunctions. Some built-in functions are very crucial (like AS, CAST,MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybewe add more experimental functions in the future or function for somespecial application area (Geo functions, ML functions). A data platformteam might not want to make every built-in function available. Or afunction module like ML functions is in a different Maven module.

3) Following the suggestion above, we can have a separate discoverymechanism for built-in functions. Instead of just going through a staticlist like in BuiltInFunctionDefinitions, a platform team should be ableto select function modules likecatalogManager.setFunctionModules(CoreFunctions, GeoFunctions,HiveFunctions) or via service discovery;

3) Dawid and I discussed the resulution order again. I agree with Kurtthat we should unify built-in function (external or internal) under acommon layer. However, the resolution order should be:

  1. built-in functions
  2. temporary functions
  3. regular catalog resolution logic

Otherwise a temporary function could cause clashes with Flink's built-infunctions. If you take a look at other vendors, like SQL Server theyalso do not allow to overwrite built-in functions.


Regards,
Timo


On 03.09.19 10:35, JingsongLee wrote:

Thanks Bowen:

+1 for this. And +1 to Kurt's suggestion. My other points are:

1.Hive built-in functions is an intermediate solution. So we should
  not introduce interfaces to influence the framework. To make
  Flink itself more powerful, we should implement the functions
  we need to add.

2.Non-flink built-in functions are easy for users to change their
behavior. If we support some flink built-in functions in the
  future but act differently from non-flink built-in, this will lead to
  changes in user behavior.

3.Fallback to Non-flink built-in functions is a bad choice to
  performance. Without flink internal codegen and data format,
  and bring data format conversion, the performance is not so
  good.

We need to support more complete hive jobs now, we need to
  have this fallback strategy. But it's not worth adding this
  concept at the catalog interface level, and it's not worth
  encouraging other catalogs to do so.

Another question is, does this fallback include all
  hive built-in functions? As far as I know, some hive functions
  have some hacky. If possible, can we start with a white list?
Once we implement some functions to flink built-in, we can
also update the whitelist.

Best,
Jingsong Lee


------------------------------------------------------------------
From:Kurt Young <ykt...@gmail.com>
Send Time:2019年9月3日(星期二) 15:41
To:dev <dev@flink.apache.org>
Subject:Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Bowen for driving this.

+1 for the general idea. It makes the function resolved behavior more
clear and deterministic. Besides, the user can use all hive built-in
functions, which is a great feature.

I only have one comment, but maybe it may touch your design so I think
it would make sense to reply this mail instead of comment on google doc.
Regarding to the classfication of functions, you currently have 4 types
of functions, which are:
1. temporary functions
2. Flink built-in functions
3. Hive built-in functions (or generalized as external built-in functions)
4. catalog functions

What I want to propose is we can merge #3 and #4, make them both under
"catalog" concept, by extending catalog function to make it have ability to
have built-in catalog functions. Some benefits I can see from this approach:
1. We don't have to introduce new concept like external built-in functions.
Actually
I don't see a full story about how to treat a built-in functions, and it
seems a little
bit disrupt with catalog. As a result, you have to make some restriction
like "hive
built-in functions can only be used when current catalog is hive catalog".

2. It makes us easier to adopt another system's built-in functions to
Flink, such as
MySQL. If we don't treat uniformly with  "external built-in functions" and
"external
catalog function", things like user set current catalog to hive but want to
use MySQL's
built-in function will happen.

One more thing, follow this approach, it's clear for your question about
how to support
external built-in functions, which is "add a  getBuiltInFunction to current
Catalog API".

What do you think?

Best,
Kurt


On Fri, Aug 30, 2019 at 7:14 AM Bowen Li <bowenl...@gmail.com> wrote:

Thanks everyone for the feedback.

I have updated the document accordingly. Here're the summary of changes:

- clarify the concept of temporary functions, to facilitate deciding
function resolution order
- provide two options to support Hive built-in functions, with the 2nd one
being preferred
- add detailed prototype code for FunctionCatalog#lookupFunction(name)
- move the section of ”rename existing FunctionCatalog APIs in favor of
temporary functions“ out of the scope of the FLIP
- add another reasonable limitation for function resolution, to not
consider resolving overloaded functions - those with the same name but
different params. (It's still valid to have a single function with
overloaded eval() methods)

Please take another look.

Thanks,
Bowen

On Tue, Aug 27, 2019 at 11:49 AM Bowen Li <bowenl...@gmail.com> wrote:

Hi folks,

I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
It's critically helpful to improve function usability in SQL.

https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing

In short, it:
- adds support for precise function reference with fully/partially
qualified name
- redefines function resolution order for ambiguous function reference
- adds support for Hive's rich built-in functions (support for Hive user
defined functions was already added in 1.9.0)
- clarifies the concept of temporary functions

Would love to hear your thoughts.

Bowen

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to