Hi all,

During the discussion of how to support Hive built-in functions in Flink in
FLIP-57 [1], an idea of "modular built-in functions" was brought up with
examples of "Extension" in Postgres [2] and "Plugin" in Presto [3]. Thus
I'd like to kick off a discussion to see if we should adopt such an
approach.

I try to summarize basics of the idea:
    - functions from modules (e.g. Geo, ML) can be loaded into Flink as
built-in functions
    - modules can be configured with order, discovered using SPI or set via
code like "catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions)"
    - built-in functions from external systems, like Hive, can be packaged
into such a module

I took time and researched Presto Plugin and Postgres Extension, and here
are some of my findings.

Presto:
    - "Presto's Catalog associated with a connector, and a catalog only
contains schemas and references a data source via a connector." [4] A
Presto catalog doesn't have the concept of catalog functions, thus all
Presto functions don't have namespaces. Neither does Presto have function
DDL [5].
    - Plugin are not specific to functions - "Plugins can provide
additional Connectors, Types, Functions, and System Access Control" [6]
    - Thus, I feel a Plugin in Presto acts more as a "catalog" which is
similar to catalogs in Flink. Since all Presto functions don't have
namespaces, it probably can be seen as a built-in function module.

Postgres:
    - Postgres extension is always installed to a schema, not the entire
cluster. There's a "schema_name" param in extension creation DDL - "The
name of the schema in which to install the extension's objects, given that
the extension allows its contents to be relocated. The named schema must
already exist. If not specified, and the extension's control file does not
specify a schema either, the current default object creation schema is
used." [7]  Thus it also acts as "catalog" for schema, and thus functions
in extension are not built-in functions to Postgres.

Therefore, I feel the examples are not exactly the "built-in function
modules" that were brought up, but feel free to correct me if I'm wrong.

Going back to the idea itself, besides it seems to be a simpler concept and
design in some ways, I have two concerns:
1. The major one is still on name resolution - how to deal with name
collisions?
    - Not allowing duplicated name won't work for Hive built-in functions
as many of them are dup named with Flink's, so we must allow modules
containing same named functions to be registered
    - One assumption of this approach seems to be, given modules are
specified in order, functions from modules can be overrode according to the
order?
    - If so, how can users reference a function that is overrode in the
above case (E.g. I may want to switch KMEANS between modules ML1 and ML2
with different implementations)?
         - If it's supported, it seems we still need some new syntax?
         - If it's not supported, that seems to be a major limitation for
users
2. The minor one is, allowing built-in functions from external system to be
accessed within Flink so widely can bring performance issue to users' jobs
    - Unlike the potential native Flink Geo or ML functions, built-in
functions from external systems come with a pretty big performance penalty
in Flink due to data conversions and different invocation mechanism.
Supporting Hive built-in functions is mainly for simplifying migration from
Hive. I'm not sure if it makes sense when a user job has nothing to do with
Hive data but unintentionally ends up using Hive built-in functions without
knowing it's penalized on performance. Though doc can help to some extent,
not all users really read docs in detail.

An alternative is to treat "function modules" as catalog.
- For Flink native function modules like Geo or ML, they can be discovered
and registered automatically at runtime with a predefined catalog name in
itself, like "ml" or "ml1", which should be unique. Their functions are
considered as built-in functions to the catalog, and can be referenced, in
some new syntax like "catalog::func", as "ml:kmeans" and "ml1:kmeans".
- For built-in functions from external systems (e.g. Hive), they have to be
referenced either as "catalog::func" to make sure users are explicitly
expecting those external functions, or as complementary built-in functions
to Flink if a config "enable_hive_built_in_functions" in HiveCatalog is
turned on.

Either approach seems to have its own benefits, and I'm open for discussion
and would like to hear others' opinions and use cases where a specific
solution is required.

Thanks,
Bowen


[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html
[2] https://www.postgresql.org/docs/10/extend-extensions.html
[3] https://prestodb.github.io/docs/current/develop/functions.html
[4]
https://prestodb.github.io/docs/current/overview/concepts.html#data-sources
[5] https://prestodb.github.io/docs/current/sql
[6] https://prestodb.github.io/docs/current/develop/spi-overview.html
[7] https://www.postgresql.org/docs/9.1/sql-createextension.html

Reply via email to