Hi,

I agree with Xuefu that the main controversial points are mainly the two
places. My thoughts on them:

1) Determinism of referencing Hive built-in functions. We can either remove
Hive built-in functions from ambiguous function resolution and require
users to use special syntax for their qualified names, or add a config flag
to catalog constructor/yaml for turning on and off Hive built-in functions
with the flag set to 'false' by default and proper doc added to help users
make their decisions.

2) Flink temp functions v.s. Flink built-in functions in ambiguous function
resolution order. We believe Flink temp functions should precede Flink
built-in functions, and I have presented my reasons. Just in case if we
cannot reach an agreement, I propose forbid users registering temp
functions in the same name as a built-in function, like MySQL's approach,
for the moment. It won't have any performance concern, since built-in
functions are all in memory and thus cost of a name check will be really
trivial.


On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <usxu...@gmail.com> wrote:

> From what I have seen, there are a couple of focal disagreements:
>
> 1. Resolution order: temp function --> flink built-in function --> catalog
> function vs flink built-in function --> temp function -> catalog function.
> 2. "External" built-in functions: how to treat built-in functions in
> external system and how users reference them
>
> For #1, I agree with Bowen that temp function needs to be at the highest
> priority because that's how a user might overwrite a built-in function
> without referencing a persistent, overwriting catalog function with a fully
> qualified name. Putting built-in functions at the highest priority
> eliminates that usage.
>
> For #2, I saw a general agreement on referencing "external" built-in
> functions such as those in Hive needs to be explicit and deterministic even
> though different approaches are proposed. To limit the scope and simply the
> usage, it seems making sense to me to introduce special syntax for user  to
> explicitly reference an external built-in function such as hive1::sqrt or
> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call
> hive1.getFunction(ObjectPath functionName) where the database name is
> absent for bulit-in functions available in that catalog hive1. I understand
> that Bowen's original proposal was trying to avoid this, but this could
> turn out to be a clean and simple solution.
>
> (Timo's modular approach is great way to "expand" Flink's built-in function
> set, which seems orthogonal and complementary to this, which could be
> tackled in further future work.)
>
> I'd be happy to hear further thoughts on the two points.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <ykt...@gmail.com> wrote:
>
> > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
> > same
> > as Bowen's. But after thinking about it, I'm currently lean to Timo's
> > suggestion.
> >
> > The reason is backward compatibility. If we follow Bowen's approach,
> let's
> > say we
> > first find function in Flink's built-in functions, and then hive's
> > built-in. For example, `foo`
> > is not supported by Flink, but hive has such built-in function. So user
> > will have hive's
> > behavior for function `foo`. And in next release, Flink realize this is a
> > very popular function
> > and add it into Flink's built-in functions, but with different behavior
> as
> > hive's. So in next
> > release, the behavior changes.
> >
> > With Timo's approach, IIUC user have to tell the framework explicitly
> what
> > kind of
> > built-in functions he would like to use. He can just tell framework to
> > abandon Flink's built-in
> > functions, and use hive's instead. User can only choose between them, but
> > not use
> > them at the same time. I think this approach is more predictable.
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <bowenl...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > Thanks for the feedback. Just a kindly reminder that the [Proposal]
> > section
> > > in the google doc was updated, please take a look first and let me know
> > if
> > > you have more questions.
> > >
> > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <bowenl...@gmail.com> wrote:
> > >
> > > > Hi Timo,
> > > >
> > > > Re> 1) We should not have the restriction "hive built-in functions
> can
> > > > only
> > > > > be used when current catalog is hive catalog". Switching a catalog
> > > > > should only have implications on the cat.db.object resolution but
> not
> > > > > functions. It would be quite convinient for users to use Hive
> > built-ins
> > > > > even if they use a Confluent schema registry or just the in-memory
> > > > catalog.
> > > >
> > > > There might be a misunderstanding here.
> > > >
> > > > First of all, Hive built-in functions are not part of Flink built-in
> > > > functions, they are catalog functions, thus if the current catalog is
> > > not a
> > > > HiveCatalog but, say, a schema registry catalog, ambiguous functions
> > > > reference just shouldn't be resolved to a different catalog.
> > > >
> > > > Second, Hive built-in functions can potentially be referenced across
> > > > catalog, but it doesn't have db namespace and we currently just don't
> > > have
> > > > a SQL syntax for it. It can be enabled when such a SQL syntax is
> > defined,
> > > > e.g. "catalog::function", but it's out of scope of this FLIP.
> > > >
> > > > 2) I would propose to have separate concepts for catalog and built-in
> > > > functions. In particular it would be nice to modularize built-in
> > > > functions. Some built-in functions are very crucial (like AS, CAST,
> > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and
> maybe
> > > > we add more experimental functions in the future or function for some
> > > > special application area (Geo functions, ML functions). A data
> platform
> > > > team might not want to make every built-in function available. Or a
> > > > function module like ML functions is in a different Maven module.
> > > >
> > > > I think this is orthogonal to this FLIP, especially we don't have the
> > > > "external built-in functions" anymore and currently the built-in
> > function
> > > > category remains untouched.
> > > >
> > > > But just to share some thoughts on the proposal, I'm not sure about
> it:
> > > > - I don't know if any other databases handle built-in functions like
> > > that.
> > > > Maybe you can give some examples? IMHO, built-in functions are system
> > > info
> > > > and should be deterministic, not depending on loaded libraries. Geo
> > > > functions should be either built-in already or just libraries
> > functions,
> > > > and library functions can be adapted to catalog APIs or of some other
> > > > syntax to use
> > > > - I don't know if all use cases stand, and many can be achieved by
> > other
> > > > approaches too. E.g. experimental functions can be taken good care of
> > by
> > > > documentations, annotations, etc
> > > > - the proposal basically introduces some concept like a pluggable
> > > built-in
> > > > function catalog, despite the already existing catalog APIs
> > > > - it brings in even more complicated scenarios to the design. E.g.
> how
> > do
> > > > you handle built-in functions in different modules but different
> names?
> > > >
> > > > In short, I'm not sure if it really stands and it looks like an
> > overkill
> > > > to me. I'd rather not go to that route. Related discussion can be on
> > its
> > > > own thread.
> > > >
> > > > 3) Following the suggestion above, we can have a separate discovery
> > > > mechanism for built-in functions. Instead of just going through a
> > static
> > > > list like in BuiltInFunctionDefinitions, a platform team should be
> able
> > > > to select function modules like
> > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > > > HiveFunctions) or via service discovery;
> > > >
> > > > Same as above. I'll leave it to its own thread.
> > > >
> > > > re > 3) Dawid and I discussed the resulution order again. I agree
> with
> > > > Kurt
> > > > > that we should unify built-in function (external or internal)
> under a
> > > > > common layer. However, the resolution order should be:
> > > > >   1. built-in functions
> > > > >   2. temporary functions
> > > > >   3. regular catalog resolution logic
> > > > > Otherwise a temporary function could cause clashes with Flink's
> > > built-in
> > > > > functions. If you take a look at other vendors, like SQL Server
> they
> > > > > also do not allow to overwrite built-in functions.
> > > >
> > > > ”I agree with Kurt that we should unify built-in function (external
> or
> > > > internal) under a common layer.“ <- I don't think this is what Kurt
> > > means.
> > > > Kurt and I are in favor of unifying built-in functions of external
> > > systems
> > > > and catalog functions. Did you type a mistake?
> > > >
> > > > Besides, I'm not sure about the resolution order you proposed.
> > Temporary
> > > > functions have a lifespan over a session and are only visible to the
> > > > session owner, they are unique to each user, and users create them on
> > > > purpose to be the highest priority in order to overwrite system info
> > > > (built-in functions in this case).
> > > >
> > > > In your case, why would users name a temporary function the same as a
> > > > built-in function then? Since using that name in ambiguous function
> > > > reference will always be resolved to built-in functions, creating a
> > > > same-named temp function would be meaningless in the end.
> > > >
> > > >
> > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <bowenl...@gmail.com> wrote:
> > > >
> > > >> Hi Jingsong,
> > > >>
> > > >> Re> 1.Hive built-in functions is an intermediate solution. So we
> > should
> > > >> > not introduce interfaces to influence the framework. To make
> > > >> > Flink itself more powerful, we should implement the functions
> > > >> > we need to add.
> > > >>
> > > >> Yes, please see the doc.
> > > >>
> > > >> Re> 2.Non-flink built-in functions are easy for users to change
> their
> > > >> > behavior. If we support some flink built-in functions in the
> > > >> > future but act differently from non-flink built-in, this will lead
> > to
> > > >> > changes in user behavior.
> > > >>
> > > >> There's no such concept as "external built-in functions" any more.
> > > >> Built-in functions of external systems will be treated as special
> > > catalog
> > > >> functions.
> > > >>
> > > >> Re> Another question is, does this fallback include all
> > > >> > hive built-in functions? As far as I know, some hive functions
> > > >> > have some hacky. If possible, can we start with a white list?
> > > >> > Once we implement some functions to flink built-in, we can
> > > >> > also update the whitelist.
> > > >>
> > > >> Yes, that's something we thought of too. I don't think it's super
> > > >> critical to the scope of this FLIP, thus I'd like to leave it to
> > future
> > > >> efforts as a nice-to-have feature.
> > > >>
> > > >>
> > > >> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <bowenl...@gmail.com>
> wrote:
> > > >>
> > > >>> Hi Kurt,
> > > >>>
> > > >>> Re: > What I want to propose is we can merge #3 and #4, make them
> > both
> > > >>> under
> > > >>> >"catalog" concept, by extending catalog function to make it have
> > > >>> ability to
> > > >>> >have built-in catalog functions. Some benefits I can see from this
> > > >>> approach:
> > > >>> >1. We don't have to introduce new concept like external built-in
> > > >>> functions.
> > > >>> >Actually I don't see a full story about how to treat a built-in
> > > >>> functions, and it
> > > >>> >seems a little bit disrupt with catalog. As a result, you have to
> > make
> > > >>> some restriction
> > > >>> >like "hive built-in functions can only be used when current
> catalog
> > is
> > > >>> hive catalog".
> > > >>>
> > > >>> Yes, I've unified #3 and #4 but it seems I didn't update some part
> of
> > > >>> the doc. I've modified those sections, and they are up to date now.
> > > >>>
> > > >>> In short, now built-in function of external systems are defined as
> a
> > > >>> special kind of catalog function in Flink, and handled by Flink as
> > > >>> following:
> > > >>> - An external built-in function must be associated with a catalog
> for
> > > >>> the purpose of decoupling flink-table and external systems.
> > > >>> - It always resides in front of catalog functions in ambiguous
> > function
> > > >>> reference order, just like in its own external system
> > > >>> - It is a special catalog function that doesn’t have a
> > schema/database
> > > >>> namespace
> > > >>> - It goes thru the same instantiation logic as other user defined
> > > >>> catalog functions in the external system
> > > >>>
> > > >>> Please take another look at the doc, and let me know if you have
> more
> > > >>> questions.
> > > >>>
> > > >>>
> > > >>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <twal...@apache.org>
> > > wrote:
> > > >>>
> > > >>>> Hi Kurt,
> > > >>>>
> > > >>>> it should not affect the functions and operations we currently
> have
> > in
> > > >>>> SQL. It just categorizes the available built-in functions. It is
> > kind
> > > >>>> of
> > > >>>> an orthogonal concept to the catalog API but built-in functions
> > > deserve
> > > >>>> this special kind of treatment. CatalogFunction still fits
> perfectly
> > > in
> > > >>>> there because the regular catalog object resolution logic is not
> > > >>>> affected. So tables and functions are resolved in the same way but
> > > with
> > > >>>> built-in functions that have priority as in the original design.
> > > >>>>
> > > >>>> Regards,
> > > >>>> Timo
> > > >>>>
> > > >>>>
> > > >>>> On 03.09.19 15:26, Kurt Young wrote:
> > > >>>> > Does this only affect the functions and operations we currently
> > have
> > > >>>> in SQL
> > > >>>> > and
> > > >>>> > have no effect on tables, right? Looks like this is an
> orthogonal
> > > >>>> concept
> > > >>>> > with Catalog?
> > > >>>> > If the answer are both yes, then the catalog function will be a
> > > weird
> > > >>>> > concept?
> > > >>>> >
> > > >>>> > Best,
> > > >>>> > Kurt
> > > >>>> >
> > > >>>> >
> > > >>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <yuzhao....@gmail.com
> >
> > > >>>> wrote:
> > > >>>> >
> > > >>>> >> The way you proposed are basically the same as what Calcite
> > does, I
> > > >>>> think
> > > >>>> >> we are in the same line.
> > > >>>> >>
> > > >>>> >> Best,
> > > >>>> >> Danny Chan
> > > >>>> >> 在 2019年9月3日 +0800 PM7:57,Timo Walther <twal...@apache.org>,写道:
> > > >>>> >>> This sounds exactly as the module approach I mentioned, no?
> > > >>>> >>>
> > > >>>> >>> Regards,
> > > >>>> >>> Timo
> > > >>>> >>>
> > > >>>> >>> On 03.09.19 13:42, Danny Chan wrote:
> > > >>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> > > >>>> >> refactoring to make our function usage more user friendly.
> > > >>>> >>>> For the topic of how to organize the builtin operators and
> > > >>>> operators
> > > >>>> >> of Hive, here is a solution from Apache Calcite, the Calcite
> way
> > is
> > > >>>> to make
> > > >>>> >> every dialect operators a “Library”, user can specify which
> > > >>>> libraries they
> > > >>>> >> want to use for a sql query. The builtin operators always comes
> > as
> > > >>>> the
> > > >>>> >> first class objects and the others are used from the order they
> > > >>>> appears.
> > > >>>> >> Maybe you can take a reference.
> > > >>>> >>>> [1]
> > > >>>> >>
> > > >>>>
> > >
> >
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> > > >>>> >>>> Best,
> > > >>>> >>>> Danny Chan
> > > >>>> >>>> 在 2019年8月28日 +0800 AM2:50,Bowen Li <bowenl...@gmail.com>,写道:
> > > >>>> >>>>> Hi folks,
> > > >>>> >>>>>
> > > >>>> >>>>> I'd like to kick off a discussion on reworking Flink's
> > > >>>> >> FunctionCatalog.
> > > >>>> >>>>> It's critically helpful to improve function usability in
> SQL.
> > > >>>> >>>>>
> > > >>>> >>>>>
> > > >>>> >>
> > > >>>>
> > >
> >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > > >>>> >>>>> In short, it:
> > > >>>> >>>>> - adds support for precise function reference with
> > > fully/partially
> > > >>>> >>>>> qualified name
> > > >>>> >>>>> - redefines function resolution order for ambiguous function
> > > >>>> >> reference
> > > >>>> >>>>> - adds support for Hive's rich built-in functions (support
> for
> > > >>>> Hive
> > > >>>> >> user
> > > >>>> >>>>> defined functions was already added in 1.9.0)
> > > >>>> >>>>> - clarifies the concept of temporary functions
> > > >>>> >>>>>
> > > >>>> >>>>> Would love to hear your thoughts.
> > > >>>> >>>>>
> > > >>>> >>>>> Bowen
> > > >>>> >>>
> > > >>>>
> > > >>>>
> > >
> >
>
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>

Reply via email to