Hi all, Thanks for the feedback. Just a kindly reminder that the [Proposal] section in the google doc was updated, please take a look first and let me know if you have more questions.
On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <bowenl...@gmail.com> wrote: > Hi Timo, > > Re> 1) We should not have the restriction "hive built-in functions can > only > > be used when current catalog is hive catalog". Switching a catalog > > should only have implications on the cat.db.object resolution but not > > functions. It would be quite convinient for users to use Hive built-ins > > even if they use a Confluent schema registry or just the in-memory > catalog. > > There might be a misunderstanding here. > > First of all, Hive built-in functions are not part of Flink built-in > functions, they are catalog functions, thus if the current catalog is not a > HiveCatalog but, say, a schema registry catalog, ambiguous functions > reference just shouldn't be resolved to a different catalog. > > Second, Hive built-in functions can potentially be referenced across > catalog, but it doesn't have db namespace and we currently just don't have > a SQL syntax for it. It can be enabled when such a SQL syntax is defined, > e.g. "catalog::function", but it's out of scope of this FLIP. > > 2) I would propose to have separate concepts for catalog and built-in > functions. In particular it would be nice to modularize built-in > functions. Some built-in functions are very crucial (like AS, CAST, > MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe > we add more experimental functions in the future or function for some > special application area (Geo functions, ML functions). A data platform > team might not want to make every built-in function available. Or a > function module like ML functions is in a different Maven module. > > I think this is orthogonal to this FLIP, especially we don't have the > "external built-in functions" anymore and currently the built-in function > category remains untouched. > > But just to share some thoughts on the proposal, I'm not sure about it: > - I don't know if any other databases handle built-in functions like that. > Maybe you can give some examples? IMHO, built-in functions are system info > and should be deterministic, not depending on loaded libraries. Geo > functions should be either built-in already or just libraries functions, > and library functions can be adapted to catalog APIs or of some other > syntax to use > - I don't know if all use cases stand, and many can be achieved by other > approaches too. E.g. experimental functions can be taken good care of by > documentations, annotations, etc > - the proposal basically introduces some concept like a pluggable built-in > function catalog, despite the already existing catalog APIs > - it brings in even more complicated scenarios to the design. E.g. how do > you handle built-in functions in different modules but different names? > > In short, I'm not sure if it really stands and it looks like an overkill > to me. I'd rather not go to that route. Related discussion can be on its > own thread. > > 3) Following the suggestion above, we can have a separate discovery > mechanism for built-in functions. Instead of just going through a static > list like in BuiltInFunctionDefinitions, a platform team should be able > to select function modules like > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions) or via service discovery; > > Same as above. I'll leave it to its own thread. > > re > 3) Dawid and I discussed the resulution order again. I agree with > Kurt > > that we should unify built-in function (external or internal) under a > > common layer. However, the resolution order should be: > > 1. built-in functions > > 2. temporary functions > > 3. regular catalog resolution logic > > Otherwise a temporary function could cause clashes with Flink's built-in > > functions. If you take a look at other vendors, like SQL Server they > > also do not allow to overwrite built-in functions. > > ”I agree with Kurt that we should unify built-in function (external or > internal) under a common layer.“ <- I don't think this is what Kurt means. > Kurt and I are in favor of unifying built-in functions of external systems > and catalog functions. Did you type a mistake? > > Besides, I'm not sure about the resolution order you proposed. Temporary > functions have a lifespan over a session and are only visible to the > session owner, they are unique to each user, and users create them on > purpose to be the highest priority in order to overwrite system info > (built-in functions in this case). > > In your case, why would users name a temporary function the same as a > built-in function then? Since using that name in ambiguous function > reference will always be resolved to built-in functions, creating a > same-named temp function would be meaningless in the end. > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <bowenl...@gmail.com> wrote: > >> Hi Jingsong, >> >> Re> 1.Hive built-in functions is an intermediate solution. So we should >> > not introduce interfaces to influence the framework. To make >> > Flink itself more powerful, we should implement the functions >> > we need to add. >> >> Yes, please see the doc. >> >> Re> 2.Non-flink built-in functions are easy for users to change their >> > behavior. If we support some flink built-in functions in the >> > future but act differently from non-flink built-in, this will lead to >> > changes in user behavior. >> >> There's no such concept as "external built-in functions" any more. >> Built-in functions of external systems will be treated as special catalog >> functions. >> >> Re> Another question is, does this fallback include all >> > hive built-in functions? As far as I know, some hive functions >> > have some hacky. If possible, can we start with a white list? >> > Once we implement some functions to flink built-in, we can >> > also update the whitelist. >> >> Yes, that's something we thought of too. I don't think it's super >> critical to the scope of this FLIP, thus I'd like to leave it to future >> efforts as a nice-to-have feature. >> >> >> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <bowenl...@gmail.com> wrote: >> >>> Hi Kurt, >>> >>> Re: > What I want to propose is we can merge #3 and #4, make them both >>> under >>> >"catalog" concept, by extending catalog function to make it have >>> ability to >>> >have built-in catalog functions. Some benefits I can see from this >>> approach: >>> >1. We don't have to introduce new concept like external built-in >>> functions. >>> >Actually I don't see a full story about how to treat a built-in >>> functions, and it >>> >seems a little bit disrupt with catalog. As a result, you have to make >>> some restriction >>> >like "hive built-in functions can only be used when current catalog is >>> hive catalog". >>> >>> Yes, I've unified #3 and #4 but it seems I didn't update some part of >>> the doc. I've modified those sections, and they are up to date now. >>> >>> In short, now built-in function of external systems are defined as a >>> special kind of catalog function in Flink, and handled by Flink as >>> following: >>> - An external built-in function must be associated with a catalog for >>> the purpose of decoupling flink-table and external systems. >>> - It always resides in front of catalog functions in ambiguous function >>> reference order, just like in its own external system >>> - It is a special catalog function that doesn’t have a schema/database >>> namespace >>> - It goes thru the same instantiation logic as other user defined >>> catalog functions in the external system >>> >>> Please take another look at the doc, and let me know if you have more >>> questions. >>> >>> >>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <twal...@apache.org> wrote: >>> >>>> Hi Kurt, >>>> >>>> it should not affect the functions and operations we currently have in >>>> SQL. It just categorizes the available built-in functions. It is kind >>>> of >>>> an orthogonal concept to the catalog API but built-in functions deserve >>>> this special kind of treatment. CatalogFunction still fits perfectly in >>>> there because the regular catalog object resolution logic is not >>>> affected. So tables and functions are resolved in the same way but with >>>> built-in functions that have priority as in the original design. >>>> >>>> Regards, >>>> Timo >>>> >>>> >>>> On 03.09.19 15:26, Kurt Young wrote: >>>> > Does this only affect the functions and operations we currently have >>>> in SQL >>>> > and >>>> > have no effect on tables, right? Looks like this is an orthogonal >>>> concept >>>> > with Catalog? >>>> > If the answer are both yes, then the catalog function will be a weird >>>> > concept? >>>> > >>>> > Best, >>>> > Kurt >>>> > >>>> > >>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <yuzhao....@gmail.com> >>>> wrote: >>>> > >>>> >> The way you proposed are basically the same as what Calcite does, I >>>> think >>>> >> we are in the same line. >>>> >> >>>> >> Best, >>>> >> Danny Chan >>>> >> 在 2019年9月3日 +0800 PM7:57,Timo Walther <twal...@apache.org>,写道: >>>> >>> This sounds exactly as the module approach I mentioned, no? >>>> >>> >>>> >>> Regards, >>>> >>> Timo >>>> >>> >>>> >>> On 03.09.19 13:42, Danny Chan wrote: >>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful >>>> >> refactoring to make our function usage more user friendly. >>>> >>>> For the topic of how to organize the builtin operators and >>>> operators >>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is >>>> to make >>>> >> every dialect operators a “Library”, user can specify which >>>> libraries they >>>> >> want to use for a sql query. The builtin operators always comes as >>>> the >>>> >> first class objects and the others are used from the order they >>>> appears. >>>> >> Maybe you can take a reference. >>>> >>>> [1] >>>> >> >>>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 >>>> >>>> Best, >>>> >>>> Danny Chan >>>> >>>> 在 2019年8月28日 +0800 AM2:50,Bowen Li <bowenl...@gmail.com>,写道: >>>> >>>>> Hi folks, >>>> >>>>> >>>> >>>>> I'd like to kick off a discussion on reworking Flink's >>>> >> FunctionCatalog. >>>> >>>>> It's critically helpful to improve function usability in SQL. >>>> >>>>> >>>> >>>>> >>>> >> >>>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing >>>> >>>>> In short, it: >>>> >>>>> - adds support for precise function reference with fully/partially >>>> >>>>> qualified name >>>> >>>>> - redefines function resolution order for ambiguous function >>>> >> reference >>>> >>>>> - adds support for Hive's rich built-in functions (support for >>>> Hive >>>> >> user >>>> >>>>> defined functions was already added in 1.9.0) >>>> >>>>> - clarifies the concept of temporary functions >>>> >>>>> >>>> >>>>> Would love to hear your thoughts. >>>> >>>>> >>>> >>>>> Bowen >>>> >>> >>>> >>>>