Re: [VOTE] FLIP-57: Rework FunctionCatalog

Timo Walther Fri, 04 Oct 2019 04:52:05 -0700

Hi,

I agree with Aljoscha. It is not transparent to me which votes arebinding to the current status of the FLIP.


Some other minor comments from my side:

- We don't need to deprecate methods in FunctionCatalog. This class isinternal. We can simply change the method signatures.- `String name` is missing in the FunctionIdentifier code example; canwe call FunctionIdentifier.getSimpleName() justFunctionIdentifier.getName()?- Add the methods that we discussed to the example: `of(String)`,`of(ObjectIdentifier)`


Other than that, I'm happy to give my +1 to this proposal.

Thanks for the productive discussion,
Timo


On 04.10.19 13:29, Aljoscha Krettek wrote:

Hi,

I see there was quite some discussion and changes on the FLIP after this VOTE 
was started. I would suggest to start a new voting thread on the current state 
of the FLIP (keeping in mind that a FLIP vote needs at least three 
committer/PMC votes).

For the future, we should probably keep discussion to the [DISCUSS] thread and 
use the vote thread only for voting.

Best,
Aljoscha

On 3. Oct 2019, at 21:17, Bowen Li <bowenl...@gmail.com> wrote:

I'm glad to announce that the community has accepted the design of FLIP-57,
and we are moving forward to implementing it.

Thanks everyone!

On Wed, Oct 2, 2019 at 11:01 AM Bowen Li <bowenl...@gmail.com> wrote:

Introducing a new term "path" to APIs like
"getShortPath(Identifier)/getLongPath(Identifier)" would be confusing to
users, thus I feel "getSimpleName/getIdentifier" is fine.

To summarize the discussion result.

   - categorize functions into 2 dimensions - system v.s. catalog,
   non-temp v.s. temp - and that give us 4 combinations
   - definition of FunctionIdentifier

                     @PublicEvolving

Class FunctionIdentifier {

    String name;

    ObjectIdentifier oi;

        // for temporary/non-temporary system function
    public FunctionIdentifier of(String name) {  }
    // for temporary/non-temporary catalog function
    public FunctionIdentifier of(ObjectIdentifier identifier) {  }


    Optional<ObjectIdentifier> getIdentifier() {}

    Optional<String> getSimpleName() {}

}


I've updated them to FLIP wiki. Please take a final look. I'll close the
voting if there's no other concern raised within 24 hours.

Cheers

On Wed, Oct 2, 2019 at 4:54 AM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

Hi,

I very much agree with Xuefu's summary of the two points, especially on
the "functionIdentifier doesn't need to reflect the categories".

For the factory methods I think methods of should be enough:

  // for temporary/non-temporary system function
    public FunctionIdentifier of(String name) {  }
  // for temporary/non-temporary catalog function
    public FunctionIdentifier of(ObjectIdentifier identifier){  }

In case of the getters I did not like the method name `getName` in the
original proposal, as in my opinion it could imply that it can return
also just the name part of an ObjectIdentifier, which should not be the
case.

I'm fine with getSimpleName/getIdentifier, but want to throw in few
other suggestions:

* getShortPath(Identifier)/getLongPath(Identifier),

* getSystemPath(Identifier)/getCatalogPath(Identifier)

+1 to any of the 3 options.

One additional thing the FunctionIdentifier should be a PublicEvolving
class, as it is part of a PublicEvolving APIs e.g. CallExpression, which
user might need to access e.g. in a filter pushdown.

I also support the Xuefu's suggestion not to support the "ALL" keyword
in the "SHOW [TEMPORARY] FUNCTIONS" statement, but as the exact design
of it  is not part of the FLIP-57, we do not need to agree on that in
this thread.

Overall I think after updating the FLIP with the outcome of the
discussion I vote +1 for it.

Best,

Dawid


On 02/10/2019 00:21, Xuefu Z wrote:

Here are some of my thoughts on the minor debates above:

1. +1 for 4 categories of functions. They are categorized along two
dimensions of binary values: X: *temporary* vs non-temporary

(persistent);

Y: *system* vs non-system (so said catalog).
2. In my opinion, class functionIdentifier doesn't really need to

reflect

the categories of the functions. Instead, we should decouple them to

make

the API more stable. Thus, my suggestion is:

@Internal
class FunctionIdentifier {
  // for temporary/non-temporary system function
    public FunctionIdentifier ofSimpleName(String name) {  }
  // for temporary/non-temporary catalog function
    public FunctionIdentifier ofIdentifier(ObjectIdentifier
identifier){  }
    public Optional<String> getSimpleName() {  }
    public Optional<ObjectIdentifier> getIdentifier() {  }
}
3. DDLs -- I don't think we need "ALL" keyword. The grammar can just be:

SHOW [TEMPORARY] [SYSTEM] FUNCTIONS.

When either keyword is missing, "ALL" is implied along that dimension.

We

should always limit the search to the system function catalog and the
current catalog/DB. I don't see a need of listing functions across
different catalogs and databases. (It can be added later if that

arises.)

Thanks,
Xuefu

On Tue, Oct 1, 2019 at 11:12 AM Bowen Li <bowenl...@gmail.com> wrote:

Hi Dawid,

Thanks for bringing the suggestions up. I was prototyping yesterday and
found out those places exactly as what you suggested.

For CallExpression and UnresolvedCallExpression, I've added them to
FLIP-57. We will replace ObjectIdentifier with FunctionIdentifier and

mark

that as a breaking change

For FunctionIdentifier, the suggested changes LGTM. Just want to bring

up

an issue on naming. It seems to me how we now name functions

categories is

a bit unclear and confusing, which is reflected on the suggested APIs

- in

FunctionIdentifier you lay out, "builtin function" would include

builtin

functions and temporary system functions as we are kind of using

"system"

and "built-in" interchangeably, and "catalog function" would include
catalog functions and temporary functions. I currently can see two
approaches to make it clearer to users.

1) Simplify FunctionIdentifier to be the following. As it's internal,

we

add comments and explanation to devs on which case the APIs support.
However, I feel this approach would be somehow a bit conflicting to

what

you want to achieve for the clarity of APIs

@Internal
class FunctionIdentifier {
  // for built-in function and temporary system function
    public FunctionIdentifier of(String name) {  }
  // for temporary function and catalog function
    public FunctionIdentifier of(ObjectIdentifier identifier){  }
    public Optional<String> getFunctionName() {  }
    public Optional<ObjectIdentifier> getObjectIdentifier() {  }
}

2) We can rename our function categories as following so there'll be

mainly

just two categories of functions, "system functions" and "catalog
functions", either of which can have temporary ones

  - built-in functions -> officially rename to "system functions" and

note

to users that "system" and "built-in" can be used interchangeably. We
prefer "system" because that's the keyword we decided to use in DDL

that

creates its temporary peers ("CREATE TEMPORARY SYSTEM FUNCTION")
  - temporary system functions
  - catalog functions
  - temporary functions  -> rename to "temporary catalog function"

@Internal
class FunctionIdentifier {
  // for temporary/non-temporary system function
    public FunctionIdentifier ofSystemFunction(String name) {  }
  // for temporary/non-temporary catalog function
    public FunctionIdentifier ofCatalogFunction(ObjectIdentifier
identifier){  }
    public Optional<String> getSystemFunctionName() {  }
    public Optional<ObjectIdentifier> getCatalogFunctionIdentifier()

{  }

}

WDYT?


On Tue, Oct 1, 2019 at 5:48 AM Fabian Hueske <fhue...@gmail.com>

wrote:

Thanks for the summary Bowen!

Looks good to me.

Cheers,
Fabian

Am Mo., 30. Sept. 2019 um 23:24 Uhr schrieb Bowen Li <

bowenl...@gmail.com

:
Hi all,

I've updated the FLIP wiki with the following changes:

- Lifespan of temp functions are not tied to those of catalogs and
databases. Users can create temp functions even though catalogs/dbs

in

their fully qualified names don't even exist.
- some new SQL commands
    - "SHOW FUNCTIONS" - list names of temp and non-temp

system/built-in

functions, and names of temp and catalog functions in the current

catalog

and db
    - "SHOW ALL FUNCTIONS" - list names of temp and non-temp

system/built

functions, and fully qualified names of temp and catalog functions in

all

catalogs and dbs
    - "SHOW ALL TEMPORARY FUNCTIONS" - list fully qualified names of

temp

functions in all catalog and db
    - "SHOW ALL TEMPORARY SYSTEM FUNCTIONS" - list names of all temp

system

functions

Let me know if you have any questions.

Seems we have resolved all concerns. If there's no more ones, I'd

like

to

close the vote by this time tomorrow.

Cheers,
Bowen

On Mon, Sep 30, 2019 at 11:59 AM Bowen Li <bowenl...@gmail.com>

wrote:

Hi,

I think above are some valid points, and we can adopt the

suggestions.

To elaborate a bit on the new SQL syntax, it would imply that,

unlike

"SHOW FUNCTION" which only return function names, "SHOW ALL

[TEMPORARY]

FUNCTIONS" would return functions' fully qualified names with

catalog

and

db names.



On Mon, Sep 30, 2019 at 6:38 AM Timo Walther <twal...@apache.org>

wrote:

Hi all,

I support Fabian's arguments. In my opinion, temporary objects

should

just be an additional layer on top of the regular catalog/database
lookup logic. Thus, a temporary table or function has always

highest

precedence and should be stable within the local session. Otherwise

it

could magically disappear while someone else is performing

modifications

in the catalog.

Furthermore, this feature is very useful for prototyping as users

can

simply express that a catalog/database is present even through they
might not have access to it currently.

Regards,
Timo


On 30.09.19 14:57, Fabian Hueske wrote:

Hi all,

Sorry for the late reply.

I think it might lead to confusing situations if temporary

functions

(or

any temporary db objects for that matter) are bound to the life

cycle

of an

(external) db/catalog.
Imaging a situation where you create a temp function in a db in an

external

catalog and use it but at some point it does not work anymore

because

some

other dropped the database from the external catalog.
Shouldn't temporary objects be only controlled by the owner of a

session?

I agree that creating temp objects in non-existing db/catalogs

sounds

bit

strange, but IMO the opposite (the db/catalog must exist for a

temp

function to be created/exist) can have significant implications

like

the

one I described.
I think it would be quite easy for users to understand that

temporary

objects are solely owned by them (and their session).
The problem of listing temporary objects could be solved by

adding a

ALL

[TEMPORARY] clause:

SHOW ALL FUNCTIONS; could show all functions regardless of the
catalog/database including temporary functions.
SHOW ALL TEMPORARY FUNCTIONS; could show all temporary functions

regardless

of the catalog/database.

Best,
Fabian

Am Sa., 28. Sept. 2019 um 02:21 Uhr schrieb Bowen Li <

bowenl...@gmail.com>:

@Dawid, do you have any other concerns? If not, I hope we can

close

the

voting.


On Thu, Sep 26, 2019 at 8:14 PM Rui Li <lirui.fu...@gmail.com>

wrote:

I'm not sure how much benefit #1 can bring us. If users just

want

to

try

out temporary functions, they can create temporary system

functions

which

don't require a catalog/DB. IIUC, the main reason why we allow

temporary

catalog function is to let users override permanent catalog

functions.

Therefore a temporary function in a non-existing catalog won't

serve

that

purpose. Besides, each session is provided with a default

catalog

and

DB.

So even if users simply want to create some catalog functions

they

can

forget about after the session, wouldn't the default catalog/DB

be

enough

for such experiments?

On Thu, Sep 26, 2019 at 4:38 AM Bowen Li <bowenl...@gmail.com>

wrote:

Re 1) As described in the FLIP, a temp function lookup will

first

make

sure

the db exists. If the db doesn't exist, a lazy drop is

triggered

to

remove

that temp function.

I agree Hive doesn't handle it consistently, and we are not

copying

Hive.

IMHO, allowing registering temp functions in nonexistent

catalog/db

is

hacky and problematic. For instance, "SHOW FUNCTIONS" would

list

system

functions and functions in the current catalog/db, since users

cannot

designate a nonexistent catalog/db as current ones, how can

they

list

functions in nonexistent catalog/db? They may end up never

knowing

what

temp functions they've created unless trying out with queries

or

we

introducing some more nonstandard SQL statements. The same

applies

to

other

temp objects like temp tables.

Re 2) A standalone FunctionIdentifier sounds good to me

On Wed, Sep 25, 2019 at 4:46 AM Dawid Wysakowicz <
wysakowicz.da...@gmail.com>
wrote:

Ad. 1
I wouldn't say it is hacky.
Moreover, how do you want ensure that the dB always exists

when

temporary

object is used?( in this particular case function). Do you

want

to

query

for the database existence whenever e.g a temporary function

is

used? I

think important aspect here is that the database can be

dropped

from

external system, not just flink or a different flink session.

E.g in case of hive, you cannot create a temporary table in a

database

that

does not exist, that's true. But if you create a temporary

table

in

database and drop that database from a different session, you

can

still

query the previously created temporary table from the original

session.

It

does not sound like a consistent behaviour to me. Why don't we

make

this

behaviour of not binding a temporary objects to the lifetime

of

database

explicit as part of the temporary objects contract? In the end

they

exist

in different layers. Permanent objects & databases in a

catalog

(in

case

of

hive megastore) whereas temporary objects in flink sessions.

That's

also

true for the original hive client. The temporary objects live

in

the

hive

client whereas databases are created in the metastore.

Ad.2
I'm open for suggestions here. The one thing I wanted to

achieve

here

is

so

that we do not change the contract of ObjectIdentifier. One

important

thing

to remember here is that we need the function identifier to be

part

of

the

FunctionDefinition object and not only as the result of the

function

lookup. At some point we want to be able to store

QueryOperations

in

the

catalogs. They can contain function calls within which we need

to

have

the

identifier.

I agree my initial suggestion is over complicated. How about

we

have

just

the FunctionIdentifier as top level class without making the
ObjectIdentifier extend from it? I think it's pretty much the

same

what

you

suggested. The only difference is that it would be a top level

class

with a

more descriptive name.


On Wed, 25 Sep 2019, 13:57 Bowen Li, <bowenl...@gmail.com>

wrote:

Sorry, I missed some parts of the solution. The complete

alternative

is

the

following, basically having separate APIs in FunctionLookup

for

ambiguous

and precise function lookup since planner is able to tell

which

API

to

call

with parsed queries, and have a unified result:

```
class FunctionLookup {

Optional<Result> lookupAmbiguousFunction(String name);


Optional<Result> lookupPreciseFunction(ObjectIdentifier oi);


class Result {
     Optional<ObjectIdentifier>  getObjectIdentifier() { ...

     String getName() { ... }
     // ...
}

}
```

Thanks,
Bowen


On Tue, Sep 24, 2019 at 9:42 PM Bowen Li <

bowenl...@gmail.com>

wrote:

Hi Dawid,

Re 1): I agree making it easy for users to run experiments

is

important.

However, I'm not sure allowing users to register temp

functions

in

nonexistent catalog/db is the optimal way. It seems a bit

hacky,

and

breaks

the current contract between Flink and users that catalog/db

must

be

valid

in order to operate on.

How about we instead focus on making it convenient to create

catalogs?

Users actually can already do it with ease via program or

SQL

CLI

yaml

file

for an in-memory catalog which has neither extra dependency

nor

external

connections. What we can further improve is DDL for

catalogs,

and I

raised

it in discussion of [FLIP 69 - Flink SQL DDL Enhancement]

driven

by

Terry

now.

In that case, if users would like to experiment via SQL,

they

can

easily

create an in memory catalog/database using DDL, then play

with

temp

functions.

Re 2): For the assumption, IIUIC, function ObjectIdentifier

has

not

been

resolved when stack call reaches

FunctionCatalog#lookupFunction(),

but

only

been parsed?

I agree keeping ObjectIdentifier as-is would be good. I'm ok

with

the

suggested classes, though making ObjectIdentifier a subclass

of

FunctionIdentifier seem a bit counter intuitive.

Another potentially simpler way is:

```
// in class FunctionLookup
class Result {
     Optional<ObjectIdentifier>  getObjectIdentifier() {

... }

     String getName() { ... }
     ...
}
```

WDYT?



On Tue, Sep 24, 2019 at 3:41 PM Dawid Wysakowicz <
wysakowicz.da...@gmail.com> wrote:

Hi,
I really like the flip and think it clarifies important

aspects

of

the

system.

I have two, I hope small suggestions, which will not take

much

time

to

agree on.

1. Could we follow the MySQL approach in regards to the

existence

of

cat/db
for temporary functions? That means not to check it, so

e.g.

it's

possible

to create a temporary function in a database that does not

exist.

think

it's really useful e.g in cases when user wants to perform

experiments

but

does not have access to the db yet or temporarily does not

have

connection

to a catalog.
2. Could we not change the ObjectIdentifier? Could we not

loosen

the

requirements for all catalog objects such as tables, views,

types

just

for

the functions? It's really important later on from e.g the

serializability

perspective. The important aspect of the ObjectIdentifier

is

that

we

know

that it has been resolved. The suggested changes break that

assumption.

What do you think about adding an interface

FunctionIdentifier

String getName();

/**
   Return 3-part identifier. Empty in case of a built-in

function.

*/
Optional<ObjectIdentifier> getObjectIdentifier()
}

class ObjectIdentifier implements FunctionIdentifier {
Optional<ObjectIdentifier> getObjectIdentifier() {
  return Optional.of(this);
}
}

class SystemFunctionIdentifier implements

FunctionIdentifier

{...}

WDYT?

On Wed, 25 Sep 2019, 04:50 Xuefu Z, <usxu...@gmail.com>

wrote:

+1. LGTM

On Tue, Sep 24, 2019 at 6:09 AM Terry Wang <

zjuwa...@gmail.com>

wrote:

+1

Best,
Terry Wang

在 2019年9月24日，上午10:42，Kurt Young <ykt...@gmail.com> 写道：

+1

Best,
Kurt


On Tue, Sep 24, 2019 at 2:30 AM Bowen Li <

bowenl...@gmail.com

wrote:

Hi all,

I'd like to start a voting thread for FLIP-57 [1],

which

we've

reached

consensus in [2].

This voting will be open for minimum 3 days till 6:30pm

UTC,

Sep

26.

Thanks,
Bowen

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog

[2]

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html#a32613

--
Xuefu Zhang

"In Honey We Trust!"

--
Best regards!
Rui Li

Re: [VOTE] FLIP-57: Rework FunctionCatalog

Reply via email to