Hi all, Regarding #1 temp function <> built-in function and naming. I'm fine with temp functions should precede built-in function and can override built-in functions (we already support to override built-in function in 1.9). If we don't allow the same name as a built-in function, I'm afraid we will have compatibility issues in the future. Say users register a user defined function named "explode" in 1.9, and we support a built-in "explode" function in 1.10. Then the user's jobs which call the registered "explode" function in 1.9 will all fail in 1.10 because of naming conflict.
Regarding #2 "External" built-in functions. I think if we store external built-in functions in catalog, then "hive1::sqrt" is a good way to go. However, I would prefer to support a discovery mechanism (e.g. SPI) for built-in functions as Timo suggested above. This gives us the flexibility to add Hive or MySQL or Geo or whatever function set as built-in functions in an easy way. Best, Jark On Wed, 4 Sep 2019 at 17:47, Xuefu Z <usxu...@gmail.com> wrote: > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that support > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. ( > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and must > have DB as holder) is one option. The advantage is simplicity, The downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need a > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <dwysakow...@apache.org> > wrote: > > > Hi all, > > > > Just an opinion on the built-in <> temporary functions resolution and > > NAMING issue. I think we should not allow overriding the built-in > > functions, as this may pose serious issues and to be honest is rather > > not feasible and would require major rework. What happens if a user > > wants to override CAST? Calls to that function are generated at > > different layers of the stack that unfortunately does not always go > > through the Catalog API (at least yet). Moreover from what I've checked > > no other systems allow overriding the built-in functions. All the > > systems I've checked so far register temporary functions in a > > database/schema (either special database for temporary functions, or > > just current database). What I would suggest is to always register > > temporary functions with a 3 part identifier. The same way as tables, > > views etc. This effectively means you cannot override built-in > > functions. With such approach it is natural that the temporary functions > > end up a step lower in the resolution order: > > > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > > > 2. temporary functions (always 3 part path) > > > > 3. catalog functions (always 3 part path) > > > > Let me know what do you think. > > > > Best, > > > > Dawid > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > Hi, > > > > > > I agree with Xuefu that the main controversial points are mainly the > two > > > places. My thoughts on them: > > > > > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > > Hive built-in functions from ambiguous function resolution and require > > > users to use special syntax for their qualified names, or add a config > > flag > > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > > with the flag set to 'false' by default and proper doc added to help > > users > > > make their decisions. > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > > resolution order. We believe Flink temp functions should precede Flink > > > built-in functions, and I have presented my reasons. Just in case if we > > > cannot reach an agreement, I propose forbid users registering temp > > > functions in the same name as a built-in function, like MySQL's > approach, > > > for the moment. It won't have any performance concern, since built-in > > > functions are all in memory and thus cost of a name check will be > really > > > trivial. > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <usxu...@gmail.com> wrote: > > > > > >> From what I have seen, there are a couple of focal disagreements: > > >> > > >> 1. Resolution order: temp function --> flink built-in function --> > > catalog > > >> function vs flink built-in function --> temp function -> catalog > > function. > > >> 2. "External" built-in functions: how to treat built-in functions in > > >> external system and how users reference them > > >> > > >> For #1, I agree with Bowen that temp function needs to be at the > highest > > >> priority because that's how a user might overwrite a built-in function > > >> without referencing a persistent, overwriting catalog function with a > > fully > > >> qualified name. Putting built-in functions at the highest priority > > >> eliminates that usage. > > >> > > >> For #2, I saw a general agreement on referencing "external" built-in > > >> functions such as those in Hive needs to be explicit and deterministic > > even > > >> though different approaches are proposed. To limit the scope and > simply > > the > > >> usage, it seems making sense to me to introduce special syntax for > > user to > > >> explicitly reference an external built-in function such as hive1::sqrt > > or > > >> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > >> hive1.getFunction(ObjectPath functionName) where the database name is > > >> absent for bulit-in functions available in that catalog hive1. I > > understand > > >> that Bowen's original proposal was trying to avoid this, but this > could > > >> turn out to be a clean and simple solution. > > >> > > >> (Timo's modular approach is great way to "expand" Flink's built-in > > function > > >> set, which seems orthogonal and complementary to this, which could be > > >> tackled in further future work.) > > >> > > >> I'd be happy to hear further thoughts on the two points. > > >> > > >> Thanks, > > >> Xuefu > > >> > > >> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <ykt...@gmail.com> wrote: > > >> > > >>> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > >>> same > > >>> as Bowen's. But after thinking about it, I'm currently lean to Timo's > > >>> suggestion. > > >>> > > >>> The reason is backward compatibility. If we follow Bowen's approach, > > >> let's > > >>> say we > > >>> first find function in Flink's built-in functions, and then hive's > > >>> built-in. For example, `foo` > > >>> is not supported by Flink, but hive has such built-in function. So > user > > >>> will have hive's > > >>> behavior for function `foo`. And in next release, Flink realize this > > is a > > >>> very popular function > > >>> and add it into Flink's built-in functions, but with different > behavior > > >> as > > >>> hive's. So in next > > >>> release, the behavior changes. > > >>> > > >>> With Timo's approach, IIUC user have to tell the framework explicitly > > >> what > > >>> kind of > > >>> built-in functions he would like to use. He can just tell framework > to > > >>> abandon Flink's built-in > > >>> functions, and use hive's instead. User can only choose between them, > > but > > >>> not use > > >>> them at the same time. I think this approach is more predictable. > > >>> > > >>> Best, > > >>> Kurt > > >>> > > >>> > > >>> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <bowenl...@gmail.com> wrote: > > >>> > > >>>> Hi all, > > >>>> > > >>>> Thanks for the feedback. Just a kindly reminder that the [Proposal] > > >>> section > > >>>> in the google doc was updated, please take a look first and let me > > know > > >>> if > > >>>> you have more questions. > > >>>> > > >>>> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <bowenl...@gmail.com> > wrote: > > >>>> > > >>>>> Hi Timo, > > >>>>> > > >>>>> Re> 1) We should not have the restriction "hive built-in functions > > >> can > > >>>>> only > > >>>>>> be used when current catalog is hive catalog". Switching a catalog > > >>>>>> should only have implications on the cat.db.object resolution but > > >> not > > >>>>>> functions. It would be quite convinient for users to use Hive > > >>> built-ins > > >>>>>> even if they use a Confluent schema registry or just the in-memory > > >>>>> catalog. > > >>>>> > > >>>>> There might be a misunderstanding here. > > >>>>> > > >>>>> First of all, Hive built-in functions are not part of Flink > built-in > > >>>>> functions, they are catalog functions, thus if the current catalog > is > > >>>> not a > > >>>>> HiveCatalog but, say, a schema registry catalog, ambiguous > functions > > >>>>> reference just shouldn't be resolved to a different catalog. > > >>>>> > > >>>>> Second, Hive built-in functions can potentially be referenced > across > > >>>>> catalog, but it doesn't have db namespace and we currently just > don't > > >>>> have > > >>>>> a SQL syntax for it. It can be enabled when such a SQL syntax is > > >>> defined, > > >>>>> e.g. "catalog::function", but it's out of scope of this FLIP. > > >>>>> > > >>>>> 2) I would propose to have separate concepts for catalog and > built-in > > >>>>> functions. In particular it would be nice to modularize built-in > > >>>>> functions. Some built-in functions are very crucial (like AS, CAST, > > >>>>> MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > >> maybe > > >>>>> we add more experimental functions in the future or function for > some > > >>>>> special application area (Geo functions, ML functions). A data > > >> platform > > >>>>> team might not want to make every built-in function available. Or a > > >>>>> function module like ML functions is in a different Maven module. > > >>>>> > > >>>>> I think this is orthogonal to this FLIP, especially we don't have > the > > >>>>> "external built-in functions" anymore and currently the built-in > > >>> function > > >>>>> category remains untouched. > > >>>>> > > >>>>> But just to share some thoughts on the proposal, I'm not sure about > > >> it: > > >>>>> - I don't know if any other databases handle built-in functions > like > > >>>> that. > > >>>>> Maybe you can give some examples? IMHO, built-in functions are > system > > >>>> info > > >>>>> and should be deterministic, not depending on loaded libraries. Geo > > >>>>> functions should be either built-in already or just libraries > > >>> functions, > > >>>>> and library functions can be adapted to catalog APIs or of some > other > > >>>>> syntax to use > > >>>>> - I don't know if all use cases stand, and many can be achieved by > > >>> other > > >>>>> approaches too. E.g. experimental functions can be taken good care > of > > >>> by > > >>>>> documentations, annotations, etc > > >>>>> - the proposal basically introduces some concept like a pluggable > > >>>> built-in > > >>>>> function catalog, despite the already existing catalog APIs > > >>>>> - it brings in even more complicated scenarios to the design. E.g. > > >> how > > >>> do > > >>>>> you handle built-in functions in different modules but different > > >> names? > > >>>>> In short, I'm not sure if it really stands and it looks like an > > >>> overkill > > >>>>> to me. I'd rather not go to that route. Related discussion can be > on > > >>> its > > >>>>> own thread. > > >>>>> > > >>>>> 3) Following the suggestion above, we can have a separate discovery > > >>>>> mechanism for built-in functions. Instead of just going through a > > >>> static > > >>>>> list like in BuiltInFunctionDefinitions, a platform team should be > > >> able > > >>>>> to select function modules like > > >>>>> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > >>>>> HiveFunctions) or via service discovery; > > >>>>> > > >>>>> Same as above. I'll leave it to its own thread. > > >>>>> > > >>>>> re > 3) Dawid and I discussed the resulution order again. I agree > > >> with > > >>>>> Kurt > > >>>>>> that we should unify built-in function (external or internal) > > >> under a > > >>>>>> common layer. However, the resolution order should be: > > >>>>>> 1. built-in functions > > >>>>>> 2. temporary functions > > >>>>>> 3. regular catalog resolution logic > > >>>>>> Otherwise a temporary function could cause clashes with Flink's > > >>>> built-in > > >>>>>> functions. If you take a look at other vendors, like SQL Server > > >> they > > >>>>>> also do not allow to overwrite built-in functions. > > >>>>> ”I agree with Kurt that we should unify built-in function (external > > >> or > > >>>>> internal) under a common layer.“ <- I don't think this is what Kurt > > >>>> means. > > >>>>> Kurt and I are in favor of unifying built-in functions of external > > >>>> systems > > >>>>> and catalog functions. Did you type a mistake? > > >>>>> > > >>>>> Besides, I'm not sure about the resolution order you proposed. > > >>> Temporary > > >>>>> functions have a lifespan over a session and are only visible to > the > > >>>>> session owner, they are unique to each user, and users create them > on > > >>>>> purpose to be the highest priority in order to overwrite system > info > > >>>>> (built-in functions in this case). > > >>>>> > > >>>>> In your case, why would users name a temporary function the same > as a > > >>>>> built-in function then? Since using that name in ambiguous function > > >>>>> reference will always be resolved to built-in functions, creating a > > >>>>> same-named temp function would be meaningless in the end. > > >>>>> > > >>>>> > > >>>>> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <bowenl...@gmail.com> > wrote: > > >>>>> > > >>>>>> Hi Jingsong, > > >>>>>> > > >>>>>> Re> 1.Hive built-in functions is an intermediate solution. So we > > >>> should > > >>>>>>> not introduce interfaces to influence the framework. To make > > >>>>>>> Flink itself more powerful, we should implement the functions > > >>>>>>> we need to add. > > >>>>>> Yes, please see the doc. > > >>>>>> > > >>>>>> Re> 2.Non-flink built-in functions are easy for users to change > > >> their > > >>>>>>> behavior. If we support some flink built-in functions in the > > >>>>>>> future but act differently from non-flink built-in, this will > lead > > >>> to > > >>>>>>> changes in user behavior. > > >>>>>> There's no such concept as "external built-in functions" any more. > > >>>>>> Built-in functions of external systems will be treated as special > > >>>> catalog > > >>>>>> functions. > > >>>>>> > > >>>>>> Re> Another question is, does this fallback include all > > >>>>>>> hive built-in functions? As far as I know, some hive functions > > >>>>>>> have some hacky. If possible, can we start with a white list? > > >>>>>>> Once we implement some functions to flink built-in, we can > > >>>>>>> also update the whitelist. > > >>>>>> Yes, that's something we thought of too. I don't think it's super > > >>>>>> critical to the scope of this FLIP, thus I'd like to leave it to > > >>> future > > >>>>>> efforts as a nice-to-have feature. > > >>>>>> > > >>>>>> > > >>>>>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <bowenl...@gmail.com> > > >> wrote: > > >>>>>>> Hi Kurt, > > >>>>>>> > > >>>>>>> Re: > What I want to propose is we can merge #3 and #4, make them > > >>> both > > >>>>>>> under > > >>>>>>>> "catalog" concept, by extending catalog function to make it have > > >>>>>>> ability to > > >>>>>>>> have built-in catalog functions. Some benefits I can see from > this > > >>>>>>> approach: > > >>>>>>>> 1. We don't have to introduce new concept like external built-in > > >>>>>>> functions. > > >>>>>>>> Actually I don't see a full story about how to treat a built-in > > >>>>>>> functions, and it > > >>>>>>>> seems a little bit disrupt with catalog. As a result, you have > to > > >>> make > > >>>>>>> some restriction > > >>>>>>>> like "hive built-in functions can only be used when current > > >> catalog > > >>> is > > >>>>>>> hive catalog". > > >>>>>>> > > >>>>>>> Yes, I've unified #3 and #4 but it seems I didn't update some > part > > >> of > > >>>>>>> the doc. I've modified those sections, and they are up to date > now. > > >>>>>>> > > >>>>>>> In short, now built-in function of external systems are defined > as > > >> a > > >>>>>>> special kind of catalog function in Flink, and handled by Flink > as > > >>>>>>> following: > > >>>>>>> - An external built-in function must be associated with a catalog > > >> for > > >>>>>>> the purpose of decoupling flink-table and external systems. > > >>>>>>> - It always resides in front of catalog functions in ambiguous > > >>> function > > >>>>>>> reference order, just like in its own external system > > >>>>>>> - It is a special catalog function that doesn’t have a > > >>> schema/database > > >>>>>>> namespace > > >>>>>>> - It goes thru the same instantiation logic as other user defined > > >>>>>>> catalog functions in the external system > > >>>>>>> > > >>>>>>> Please take another look at the doc, and let me know if you have > > >> more > > >>>>>>> questions. > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <twal...@apache.org> > > >>>> wrote: > > >>>>>>>> Hi Kurt, > > >>>>>>>> > > >>>>>>>> it should not affect the functions and operations we currently > > >> have > > >>> in > > >>>>>>>> SQL. It just categorizes the available built-in functions. It is > > >>> kind > > >>>>>>>> of > > >>>>>>>> an orthogonal concept to the catalog API but built-in functions > > >>>> deserve > > >>>>>>>> this special kind of treatment. CatalogFunction still fits > > >> perfectly > > >>>> in > > >>>>>>>> there because the regular catalog object resolution logic is not > > >>>>>>>> affected. So tables and functions are resolved in the same way > but > > >>>> with > > >>>>>>>> built-in functions that have priority as in the original design. > > >>>>>>>> > > >>>>>>>> Regards, > > >>>>>>>> Timo > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On 03.09.19 15:26, Kurt Young wrote: > > >>>>>>>>> Does this only affect the functions and operations we currently > > >>> have > > >>>>>>>> in SQL > > >>>>>>>>> and > > >>>>>>>>> have no effect on tables, right? Looks like this is an > > >> orthogonal > > >>>>>>>> concept > > >>>>>>>>> with Catalog? > > >>>>>>>>> If the answer are both yes, then the catalog function will be a > > >>>> weird > > >>>>>>>>> concept? > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Kurt > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > yuzhao....@gmail.com > > >>>>>>>> wrote: > > >>>>>>>>>> The way you proposed are basically the same as what Calcite > > >>> does, I > > >>>>>>>> think > > >>>>>>>>>> we are in the same line. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Danny Chan > > >>>>>>>>>> 在 2019年9月3日 +0800 PM7:57,Timo Walther <twal...@apache.org > >,写道: > > >>>>>>>>>>> This sounds exactly as the module approach I mentioned, no? > > >>>>>>>>>>> > > >>>>>>>>>>> Regards, > > >>>>>>>>>>> Timo > > >>>>>>>>>>> > > >>>>>>>>>>> On 03.09.19 13:42, Danny Chan wrote: > > >>>>>>>>>>>> Thanks Bowen for bring up this topic, I think it’s a useful > > >>>>>>>>>> refactoring to make our function usage more user friendly. > > >>>>>>>>>>>> For the topic of how to organize the builtin operators and > > >>>>>>>> operators > > >>>>>>>>>> of Hive, here is a solution from Apache Calcite, the Calcite > > >> way > > >>> is > > >>>>>>>> to make > > >>>>>>>>>> every dialect operators a “Library”, user can specify which > > >>>>>>>> libraries they > > >>>>>>>>>> want to use for a sql query. The builtin operators always > comes > > >>> as > > >>>>>>>> the > > >>>>>>>>>> first class objects and the others are used from the order > they > > >>>>>>>> appears. > > >>>>>>>>>> Maybe you can take a reference. > > >>>>>>>>>>>> [1] > > >> > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Danny Chan > > >>>>>>>>>>>> 在 2019年8月28日 +0800 AM2:50,Bowen Li <bowenl...@gmail.com > >,写道: > > >>>>>>>>>>>>> Hi folks, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I'd like to kick off a discussion on reworking Flink's > > >>>>>>>>>> FunctionCatalog. > > >>>>>>>>>>>>> It's critically helpful to improve function usability in > > >> SQL. > > >>>>>>>>>>>>> > > >> > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > >>>>>>>>>>>>> In short, it: > > >>>>>>>>>>>>> - adds support for precise function reference with > > >>>> fully/partially > > >>>>>>>>>>>>> qualified name > > >>>>>>>>>>>>> - redefines function resolution order for ambiguous > function > > >>>>>>>>>> reference > > >>>>>>>>>>>>> - adds support for Hive's rich built-in functions (support > > >> for > > >>>>>>>> Hive > > >>>>>>>>>> user > > >>>>>>>>>>>>> defined functions was already added in 1.9.0) > > >>>>>>>>>>>>> - clarifies the concept of temporary functions > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Would love to hear your thoughts. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Bowen > > >>>>>>>> > > >> > > >> -- > > >> Xuefu Zhang > > >> > > >> "In Honey We Trust!" > > >> > > > > > > -- > Xuefu Zhang > > "In Honey We Trust!" >