Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Jark Wu Wed, 04 Sep 2019 04:45:07 -0700

Hi all,

Regarding #1 temp function <> built-in function and naming.
I'm fine with temp functions should precede built-in function and can
override built-in functions (we already support to override built-in
function in 1.9).
If we don't allow the same name as a built-in function, I'm afraid we will
have compatibility issues in the future.
Say users register a user defined function named "explode" in 1.9, and we
support a built-in "explode" function in 1.10.
Then the user's jobs which call the registered "explode" function in 1.9
will all fail in 1.10 because of naming conflict.


Regarding #2 "External" built-in functions.
I think if we store external built-in functions in catalog, then
"hive1::sqrt" is a good way to go.
However, I would prefer to support a discovery mechanism (e.g. SPI) for
built-in functions as Timo suggested above.
This gives us the flexibility to add Hive or MySQL or Geo or whatever
function set as built-in functions in an easy way.

Best,
Jark

On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[email protected]> wrote:

> Hi David,
>
> Thank you for sharing your findings. It seems to me that there is no SQL
> standard regarding temporary functions. There are few systems that support
> it. Here are what I have found:
>
> 1. Hive: no DB qualifier allowed. Can overwrite built-in.
> 2. Spark: basically follows Hive (
>
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
> )
> 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting
> behavior. (
> http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html)
>
> Because of lack of standard, it's perfectly fine for Flink to define
> whatever it sees appropriate. Thus, your proposal (no overwriting and must
> have DB as holder) is one option. The advantage is simplicity, The downside
> is the deviation from Hive, which is popular and de facto standard in big
> data world.
>
> However, I don't think we have to follow Hive. More importantly, we need a
> consensus. I have no objection if your proposal is generally agreed upon.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[email protected]>
> wrote:
>
> > Hi all,
> >
> > Just an opinion on the built-in <> temporary functions resolution and
> > NAMING issue. I think we should not allow overriding the built-in
> > functions, as this may pose serious issues and to be honest is rather
> > not feasible and would require major rework. What happens if a user
> > wants to override CAST? Calls to that function are generated at
> > different layers of the stack that unfortunately does not always go
> > through the Catalog API (at least yet). Moreover from what I've checked
> > no other systems allow overriding the built-in functions. All the
> > systems I've checked so far register temporary functions in a
> > database/schema (either special database for temporary functions, or
> > just current database). What I would suggest is to always register
> > temporary functions with a 3 part identifier. The same way as tables,
> > views etc. This effectively means you cannot override built-in
> > functions. With such approach it is natural that the temporary functions
> > end up a step lower in the resolution order:
> >
> > 1. built-in functions (1 part, maybe 2? - this is still under discussion)
> >
> > 2. temporary functions (always 3 part path)
> >
> > 3. catalog functions (always 3 part path)
> >
> > Let me know what do you think.
> >
> > Best,
> >
> > Dawid
> >
> > On 04/09/2019 06:13, Bowen Li wrote:
> > > Hi,
> > >
> > > I agree with Xuefu that the main controversial points are mainly the
> two
> > > places. My thoughts on them:
> > >
> > > 1) Determinism of referencing Hive built-in functions. We can either
> > remove
> > > Hive built-in functions from ambiguous function resolution and require
> > > users to use special syntax for their qualified names, or add a config
> > flag
> > > to catalog constructor/yaml for turning on and off Hive built-in
> > functions
> > > with the flag set to 'false' by default and proper doc added to help
> > users
> > > make their decisions.
> > >
> > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous
> > function
> > > resolution order. We believe Flink temp functions should precede Flink
> > > built-in functions, and I have presented my reasons. Just in case if we
> > > cannot reach an agreement, I propose forbid users registering temp
> > > functions in the same name as a built-in function, like MySQL's
> approach,
> > > for the moment. It won't have any performance concern, since built-in
> > > functions are all in memory and thus cost of a name check will be
> really
> > > trivial.
> > >
> > >
> > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[email protected]> wrote:
> > >
> > >> From what I have seen, there are a couple of focal disagreements:
> > >>
> > >> 1. Resolution order: temp function --> flink built-in function -->
> > catalog
> > >> function vs flink built-in function --> temp function -> catalog
> > function.
> > >> 2. "External" built-in functions: how to treat built-in functions in
> > >> external system and how users reference them
> > >>
> > >> For #1, I agree with Bowen that temp function needs to be at the
> highest
> > >> priority because that's how a user might overwrite a built-in function
> > >> without referencing a persistent, overwriting catalog function with a
> > fully
> > >> qualified name. Putting built-in functions at the highest priority
> > >> eliminates that usage.
> > >>
> > >> For #2, I saw a general agreement on referencing "external" built-in
> > >> functions such as those in Hive needs to be explicit and deterministic
> > even
> > >> though different approaches are proposed. To limit the scope and
> simply
> > the
> > >> usage, it seems making sense to me to introduce special syntax for
> > user  to
> > >> explicitly reference an external built-in function such as hive1::sqrt
> > or
> > >> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API
> > call
> > >> hive1.getFunction(ObjectPath functionName) where the database name is
> > >> absent for bulit-in functions available in that catalog hive1. I
> > understand
> > >> that Bowen's original proposal was trying to avoid this, but this
> could
> > >> turn out to be a clean and simple solution.
> > >>
> > >> (Timo's modular approach is great way to "expand" Flink's built-in
> > function
> > >> set, which seems orthogonal and complementary to this, which could be
> > >> tackled in further future work.)
> > >>
> > >> I'd be happy to hear further thoughts on the two points.
> > >>
> > >> Thanks,
> > >> Xuefu
> > >>
> > >> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[email protected]> wrote:
> > >>
> > >>> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is
> > the
> > >>> same
> > >>> as Bowen's. But after thinking about it, I'm currently lean to Timo's
> > >>> suggestion.
> > >>>
> > >>> The reason is backward compatibility. If we follow Bowen's approach,
> > >> let's
> > >>> say we
> > >>> first find function in Flink's built-in functions, and then hive's
> > >>> built-in. For example, `foo`
> > >>> is not supported by Flink, but hive has such built-in function. So
> user
> > >>> will have hive's
> > >>> behavior for function `foo`. And in next release, Flink realize this
> > is a
> > >>> very popular function
> > >>> and add it into Flink's built-in functions, but with different
> behavior
> > >> as
> > >>> hive's. So in next
> > >>> release, the behavior changes.
> > >>>
> > >>> With Timo's approach, IIUC user have to tell the framework explicitly
> > >> what
> > >>> kind of
> > >>> built-in functions he would like to use. He can just tell framework
> to
> > >>> abandon Flink's built-in
> > >>> functions, and use hive's instead. User can only choose between them,
> > but
> > >>> not use
> > >>> them at the same time. I think this approach is more predictable.
> > >>>
> > >>> Best,
> > >>> Kurt
> > >>>
> > >>>
> > >>> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[email protected]> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> Thanks for the feedback. Just a kindly reminder that the [Proposal]
> > >>> section
> > >>>> in the google doc was updated, please take a look first and let me
> > know
> > >>> if
> > >>>> you have more questions.
> > >>>>
> > >>>> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[email protected]>
> wrote:
> > >>>>
> > >>>>> Hi Timo,
> > >>>>>
> > >>>>> Re> 1) We should not have the restriction "hive built-in functions
> > >> can
> > >>>>> only
> > >>>>>> be used when current catalog is hive catalog". Switching a catalog
> > >>>>>> should only have implications on the cat.db.object resolution but
> > >> not
> > >>>>>> functions. It would be quite convinient for users to use Hive
> > >>> built-ins
> > >>>>>> even if they use a Confluent schema registry or just the in-memory
> > >>>>> catalog.
> > >>>>>
> > >>>>> There might be a misunderstanding here.
> > >>>>>
> > >>>>> First of all, Hive built-in functions are not part of Flink
> built-in
> > >>>>> functions, they are catalog functions, thus if the current catalog
> is
> > >>>> not a
> > >>>>> HiveCatalog but, say, a schema registry catalog, ambiguous
> functions
> > >>>>> reference just shouldn't be resolved to a different catalog.
> > >>>>>
> > >>>>> Second, Hive built-in functions can potentially be referenced
> across
> > >>>>> catalog, but it doesn't have db namespace and we currently just
> don't
> > >>>> have
> > >>>>> a SQL syntax for it. It can be enabled when such a SQL syntax is
> > >>> defined,
> > >>>>> e.g. "catalog::function", but it's out of scope of this FLIP.
> > >>>>>
> > >>>>> 2) I would propose to have separate concepts for catalog and
> built-in
> > >>>>> functions. In particular it would be nice to modularize built-in
> > >>>>> functions. Some built-in functions are very crucial (like AS, CAST,
> > >>>>> MINUS), others are more optional but stable (MD5, CONCAT_WS), and
> > >> maybe
> > >>>>> we add more experimental functions in the future or function for
> some
> > >>>>> special application area (Geo functions, ML functions). A data
> > >> platform
> > >>>>> team might not want to make every built-in function available. Or a
> > >>>>> function module like ML functions is in a different Maven module.
> > >>>>>
> > >>>>> I think this is orthogonal to this FLIP, especially we don't have
> the
> > >>>>> "external built-in functions" anymore and currently the built-in
> > >>> function
> > >>>>> category remains untouched.
> > >>>>>
> > >>>>> But just to share some thoughts on the proposal, I'm not sure about
> > >> it:
> > >>>>> - I don't know if any other databases handle built-in functions
> like
> > >>>> that.
> > >>>>> Maybe you can give some examples? IMHO, built-in functions are
> system
> > >>>> info
> > >>>>> and should be deterministic, not depending on loaded libraries. Geo
> > >>>>> functions should be either built-in already or just libraries
> > >>> functions,
> > >>>>> and library functions can be adapted to catalog APIs or of some
> other
> > >>>>> syntax to use
> > >>>>> - I don't know if all use cases stand, and many can be achieved by
> > >>> other
> > >>>>> approaches too. E.g. experimental functions can be taken good care
> of
> > >>> by
> > >>>>> documentations, annotations, etc
> > >>>>> - the proposal basically introduces some concept like a pluggable
> > >>>> built-in
> > >>>>> function catalog, despite the already existing catalog APIs
> > >>>>> - it brings in even more complicated scenarios to the design. E.g.
> > >> how
> > >>> do
> > >>>>> you handle built-in functions in different modules but different
> > >> names?
> > >>>>> In short, I'm not sure if it really stands and it looks like an
> > >>> overkill
> > >>>>> to me. I'd rather not go to that route. Related discussion can be
> on
> > >>> its
> > >>>>> own thread.
> > >>>>>
> > >>>>> 3) Following the suggestion above, we can have a separate discovery
> > >>>>> mechanism for built-in functions. Instead of just going through a
> > >>> static
> > >>>>> list like in BuiltInFunctionDefinitions, a platform team should be
> > >> able
> > >>>>> to select function modules like
> > >>>>> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > >>>>> HiveFunctions) or via service discovery;
> > >>>>>
> > >>>>> Same as above. I'll leave it to its own thread.
> > >>>>>
> > >>>>> re > 3) Dawid and I discussed the resulution order again. I agree
> > >> with
> > >>>>> Kurt
> > >>>>>> that we should unify built-in function (external or internal)
> > >> under a
> > >>>>>> common layer. However, the resolution order should be:
> > >>>>>>   1. built-in functions
> > >>>>>>   2. temporary functions
> > >>>>>>   3. regular catalog resolution logic
> > >>>>>> Otherwise a temporary function could cause clashes with Flink's
> > >>>> built-in
> > >>>>>> functions. If you take a look at other vendors, like SQL Server
> > >> they
> > >>>>>> also do not allow to overwrite built-in functions.
> > >>>>> ”I agree with Kurt that we should unify built-in function (external
> > >> or
> > >>>>> internal) under a common layer.“ <- I don't think this is what Kurt
> > >>>> means.
> > >>>>> Kurt and I are in favor of unifying built-in functions of external
> > >>>> systems
> > >>>>> and catalog functions. Did you type a mistake?
> > >>>>>
> > >>>>> Besides, I'm not sure about the resolution order you proposed.
> > >>> Temporary
> > >>>>> functions have a lifespan over a session and are only visible to
> the
> > >>>>> session owner, they are unique to each user, and users create them
> on
> > >>>>> purpose to be the highest priority in order to overwrite system
> info
> > >>>>> (built-in functions in this case).
> > >>>>>
> > >>>>> In your case, why would users name a temporary function the same
> as a
> > >>>>> built-in function then? Since using that name in ambiguous function
> > >>>>> reference will always be resolved to built-in functions, creating a
> > >>>>> same-named temp function would be meaningless in the end.
> > >>>>>
> > >>>>>
> > >>>>> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[email protected]>
> wrote:
> > >>>>>
> > >>>>>> Hi Jingsong,
> > >>>>>>
> > >>>>>> Re> 1.Hive built-in functions is an intermediate solution. So we
> > >>> should
> > >>>>>>> not introduce interfaces to influence the framework. To make
> > >>>>>>> Flink itself more powerful, we should implement the functions
> > >>>>>>> we need to add.
> > >>>>>> Yes, please see the doc.
> > >>>>>>
> > >>>>>> Re> 2.Non-flink built-in functions are easy for users to change
> > >> their
> > >>>>>>> behavior. If we support some flink built-in functions in the
> > >>>>>>> future but act differently from non-flink built-in, this will
> lead
> > >>> to
> > >>>>>>> changes in user behavior.
> > >>>>>> There's no such concept as "external built-in functions" any more.
> > >>>>>> Built-in functions of external systems will be treated as special
> > >>>> catalog
> > >>>>>> functions.
> > >>>>>>
> > >>>>>> Re> Another question is, does this fallback include all
> > >>>>>>> hive built-in functions? As far as I know, some hive functions
> > >>>>>>> have some hacky. If possible, can we start with a white list?
> > >>>>>>> Once we implement some functions to flink built-in, we can
> > >>>>>>> also update the whitelist.
> > >>>>>> Yes, that's something we thought of too. I don't think it's super
> > >>>>>> critical to the scope of this FLIP, thus I'd like to leave it to
> > >>> future
> > >>>>>> efforts as a nice-to-have feature.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[email protected]>
> > >> wrote:
> > >>>>>>> Hi Kurt,
> > >>>>>>>
> > >>>>>>> Re: > What I want to propose is we can merge #3 and #4, make them
> > >>> both
> > >>>>>>> under
> > >>>>>>>> "catalog" concept, by extending catalog function to make it have
> > >>>>>>> ability to
> > >>>>>>>> have built-in catalog functions. Some benefits I can see from
> this
> > >>>>>>> approach:
> > >>>>>>>> 1. We don't have to introduce new concept like external built-in
> > >>>>>>> functions.
> > >>>>>>>> Actually I don't see a full story about how to treat a built-in
> > >>>>>>> functions, and it
> > >>>>>>>> seems a little bit disrupt with catalog. As a result, you have
> to
> > >>> make
> > >>>>>>> some restriction
> > >>>>>>>> like "hive built-in functions can only be used when current
> > >> catalog
> > >>> is
> > >>>>>>> hive catalog".
> > >>>>>>>
> > >>>>>>> Yes, I've unified #3 and #4 but it seems I didn't update some
> part
> > >> of
> > >>>>>>> the doc. I've modified those sections, and they are up to date
> now.
> > >>>>>>>
> > >>>>>>> In short, now built-in function of external systems are defined
> as
> > >> a
> > >>>>>>> special kind of catalog function in Flink, and handled by Flink
> as
> > >>>>>>> following:
> > >>>>>>> - An external built-in function must be associated with a catalog
> > >> for
> > >>>>>>> the purpose of decoupling flink-table and external systems.
> > >>>>>>> - It always resides in front of catalog functions in ambiguous
> > >>> function
> > >>>>>>> reference order, just like in its own external system
> > >>>>>>> - It is a special catalog function that doesn’t have a
> > >>> schema/database
> > >>>>>>> namespace
> > >>>>>>> - It goes thru the same instantiation logic as other user defined
> > >>>>>>> catalog functions in the external system
> > >>>>>>>
> > >>>>>>> Please take another look at the doc, and let me know if you have
> > >> more
> > >>>>>>> questions.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[email protected]>
> > >>>> wrote:
> > >>>>>>>> Hi Kurt,
> > >>>>>>>>
> > >>>>>>>> it should not affect the functions and operations we currently
> > >> have
> > >>> in
> > >>>>>>>> SQL. It just categorizes the available built-in functions. It is
> > >>> kind
> > >>>>>>>> of
> > >>>>>>>> an orthogonal concept to the catalog API but built-in functions
> > >>>> deserve
> > >>>>>>>> this special kind of treatment. CatalogFunction still fits
> > >> perfectly
> > >>>> in
> > >>>>>>>> there because the regular catalog object resolution logic is not
> > >>>>>>>> affected. So tables and functions are resolved in the same way
> but
> > >>>> with
> > >>>>>>>> built-in functions that have priority as in the original design.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Timo
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 03.09.19 15:26, Kurt Young wrote:
> > >>>>>>>>> Does this only affect the functions and operations we currently
> > >>> have
> > >>>>>>>> in SQL
> > >>>>>>>>> and
> > >>>>>>>>> have no effect on tables, right? Looks like this is an
> > >> orthogonal
> > >>>>>>>> concept
> > >>>>>>>>> with Catalog?
> > >>>>>>>>> If the answer are both yes, then the catalog function will be a
> > >>>> weird
> > >>>>>>>>> concept?
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Kurt
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <
> [email protected]
> > >>>>>>>> wrote:
> > >>>>>>>>>> The way you proposed are basically the same as what Calcite
> > >>> does, I
> > >>>>>>>> think
> > >>>>>>>>>> we are in the same line.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Danny Chan
> > >>>>>>>>>> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[email protected]
> >，写道：
> > >>>>>>>>>>> This sounds exactly as the module approach I mentioned, no?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Timo
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 03.09.19 13:42, Danny Chan wrote:
> > >>>>>>>>>>>> Thanks Bowen for bring up this topic, I think it’s a useful
> > >>>>>>>>>> refactoring to make our function usage more user friendly.
> > >>>>>>>>>>>> For the topic of how to organize the builtin operators and
> > >>>>>>>> operators
> > >>>>>>>>>> of Hive, here is a solution from Apache Calcite, the Calcite
> > >> way
> > >>> is
> > >>>>>>>> to make
> > >>>>>>>>>> every dialect operators a “Library”, user can specify which
> > >>>>>>>> libraries they
> > >>>>>>>>>> want to use for a sql query. The builtin operators always
> comes
> > >>> as
> > >>>>>>>> the
> > >>>>>>>>>> first class objects and the others are used from the order
> they
> > >>>>>>>> appears.
> > >>>>>>>>>> Maybe you can take a reference.
> > >>>>>>>>>>>> [1]
> > >>
> >
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Danny Chan
> > >>>>>>>>>>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[email protected]
> >，写道：
> > >>>>>>>>>>>>> Hi folks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'd like to kick off a discussion on reworking Flink's
> > >>>>>>>>>> FunctionCatalog.
> > >>>>>>>>>>>>> It's critically helpful to improve function usability in
> > >> SQL.
> > >>>>>>>>>>>>>
> > >>
> >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > >>>>>>>>>>>>> In short, it:
> > >>>>>>>>>>>>> - adds support for precise function reference with
> > >>>> fully/partially
> > >>>>>>>>>>>>> qualified name
> > >>>>>>>>>>>>> - redefines function resolution order for ambiguous
> function
> > >>>>>>>>>> reference
> > >>>>>>>>>>>>> - adds support for Hive's rich built-in functions (support
> > >> for
> > >>>>>>>> Hive
> > >>>>>>>>>> user
> > >>>>>>>>>>>>> defined functions was already added in 1.9.0)
> > >>>>>>>>>>>>> - clarifies the concept of temporary functions
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Would love to hear your thoughts.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Bowen
> > >>>>>>>>
> > >>
> > >> --
> > >> Xuefu Zhang
> > >>
> > >> "In Honey We Trust!"
> > >>
> >
> >
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to