Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

Hao Li Thu, 28 Mar 2024 10:16:10 -0700

Thanks Timo. I'll start a vote tomorrow if no further discussion.

Thanks,
Hao


On Thu, Mar 28, 2024 at 9:33 AM Timo Walther <twal...@apache.org> wrote:

> Hi everyone,
>
> I updated the FLIP according to this discussion.
>
> @Hao Li: Let me know if I made a mistake somewhere. I added some
> additional explaning comments about the new PTF syntax.
>
> There are no further objections from my side. If nobody objects, Hao
> feel free to start the voting tomorrow.
>
> Regards,
> Timo
>
>
> On 28.03.24 16:30, Jark Wu wrote:
> > Thanks, Hao,
> >
> > Sounds good to me.
> >
> > Best,
> > Jark
> >
> > On Thu, 28 Mar 2024 at 01:02, Hao Li <h...@confluent.io.invalid> wrote:
> >
> >> Hi Jark,
> >>
> >> I think we can start with supporting popular model providers such as
> >> openai, azureml, sagemaker for remote models.
> >>
> >> Thanks,
> >> Hao
> >>
> >> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu <imj...@gmail.com> wrote:
> >>
> >>> Thanks for the PoC and updating,
> >>>
> >>> The final syntax looks good to me, at least it is a nice and concise
> >> first
> >>> step.
> >>>
> >>> SELECT f1, f2, label FROM
> >>>     ML_PREDICT(
> >>>       input => `my_data`,
> >>>       model => `my_cat`.`my_db`.`classifier_model`,
> >>>       args => DESCRIPTOR(f1, f2));
> >>>
> >>> Besides, what built-in models will we support in the FLIP? This might
> be
> >>> important
> >>> because it relates to what use cases can run with the new Flink version
> >> out
> >>> of the box.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> On Wed, 27 Mar 2024 at 01:10, Hao Li <h...@confluent.io.invalid> wrote:
> >>>
> >>>> Hi Timo,
> >>>>
> >>>> Yeah. For `primary key` and `from table(...)` those are explicitly
> >>> matched
> >>>> in parser: [1].
> >>>>
> >>>>> SELECT f1, f2, label FROM
> >>>>     ML_PREDICT(
> >>>>       input => `my_data`,
> >>>>       model => `my_cat`.`my_db`.`classifier_model`,
> >>>>       args => DESCRIPTOR(f1, f2));
> >>>>
> >>>> This named argument syntax looks good to me. It can be supported
> >> together
> >>>> with
> >>>>
> >>>> SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
> >>>> `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
> >>>>
> >>>> Sure. Will let you know once updated the FLIP.
> >>>>
> >>>> [1]
> >>>>
> >>>>
> >>>
> >>
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
> >>>>
> >>>> Thanks,
> >>>> Hao
> >>>>
> >>>> On Tue, Mar 26, 2024 at 4:15 AM Timo Walther <twal...@apache.org>
> >> wrote:
> >>>>
> >>>>> Hi Hao,
> >>>>>
> >>>>>   > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)`
> >> doesn't
> >>>>>   > work since `TABLE` and `MODEL` are already key words
> >>>>>
> >>>>> This argument doesn't count. The parser supports introducing keywords
> >>>>> that are still non-reserved. For example, this enables using "key"
> >> for
> >>>>> both primary key and a column name:
> >>>>>
> >>>>> CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> >>>>> WITH ('connector' = 'datagen');
> >>>>>
> >>>>> SELECT i AS key FROM t;
> >>>>>
> >>>>> I'm sure we will introduce `TABLE(my_data)` eventually as this is
> >> what
> >>>>> the standard dictates. But for now, let's use the most compact syntax
> >>>>> possible which is also in sync with Oracle.
> >>>>>
> >>>>> TLDR: We allow identifiers as arguments for PTFs which are expanded
> >>> with
> >>>>> catalog and database if necessary. Those identifier arguments
> >> translate
> >>>>> to catalog lookups for table and models. The ML_ functions will make
> >>>>> sure that the arguments are of correct type model or table.
> >>>>>
> >>>>> SELECT f1, f2, label FROM
> >>>>>     ML_PREDICT(
> >>>>>       input => `my_data`,
> >>>>>       model => `my_cat`.`my_db`.`classifier_model`,
> >>>>>       args => DESCRIPTOR(f1, f2));
> >>>>>
> >>>>> So this will allow us to also use in the future:
> >>>>>
> >>>>> SELECT * FROM poly_func(table1);
> >>>>>
> >>>>> Same support as Oracle [1]. Very concise.
> >>>>>
> >>>>> Let me know when you updated the FLIP for a final review before
> >> voting.
> >>>>>
> >>>>> Do others have additional objections?
> >>>>>
> >>>>> Regards,
> >>>>> Timo
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 25.03.24 23:40, Hao Li wrote:
> >>>>>> Hi Timo,
> >>>>>>
> >>>>>>> Please double check if this is implementable with the current
> >>> stack. I
> >>>>>> fear the parser or validator might not like the "identifier"
> >>> argument?
> >>>>>>
> >>>>>> I checked this, currently the validator throws an exception trying
> >> to
> >>>> get
> >>>>>> the full qualifier name for `classifier_model`. But since
> >>>>>> `SqlValidatorImpl` is implemented in Flink, we should be able to
> >> fix
> >>>>> this.
> >>>>>> The only caveator is if not full model path is provided,
> >>>>>> the qualifier is interpreted as a column. We should be able to
> >>> special
> >>>>>> handle this by rewriting the `ml_predict` function to add the
> >> catalog
> >>>> and
> >>>>>> database name in `FlinkCalciteSqlValidator` though.
> >>>>>>
> >>>>>>> SELECT f1, f2, label FROM
> >>>>>>      ML_PREDICT(
> >>>>>>        TABLE `my_data`,
> >>>>>>        my_cat.my_db.classifier_model,
> >>>>>>        DESCRIPTOR(f1, f2))
> >>>>>>
> >>>>>> SELECT f1, f2, label FROM
> >>>>>>      ML_PREDICT(
> >>>>>>        input => TABLE `my_data`,
> >>>>>>        model => my_cat.my_db.classifier_model,
> >>>>>>        args => DESCRIPTOR(f1, f2))
> >>>>>>
> >>>>>> I verified these can be parsed. The problem is in validator for
> >>>> qualifier
> >>>>>> as mentioned above.
> >>>>>>
> >>>>>>> So the safest option would be the long-term solution:
> >>>>>>
> >>>>>> SELECT f1, f2, label FROM
> >>>>>>      ML_PREDICT(
> >>>>>>        input => TABLE(my_data),
> >>>>>>        model => MODEL(my_cat.my_db.classifier_model),
> >>>>>>        args => DESCRIPTOR(f1, f2))
> >>>>>>
> >>>>>> `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> >>>> work
> >>>>>> since `TABLE` and `MODEL` are already key words in calcite used by
> >>>>> `CREATE
> >>>>>> TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and
> >> will
> >>> be
> >>>>>> treated as a function.
> >>>>>>
> >>>>>> So I think
> >>>>>>
> >>>>>> SELECT f1, f2, label FROM
> >>>>>>      ML_PREDICT(
> >>>>>>        input => TABLE `my_data`,
> >>>>>>        model => my_cat.my_db.classifier_model,
> >>>>>>        args => DESCRIPTOR(f1, f2))
> >>>>>> should be fine for now.
> >>>>>>
> >>>>>> For the syntax part:
> >>>>>> 1). Sounds good. We can drop model task and model kind from the
> >>>>> definition.
> >>>>>> They can be deduced from the options.
> >>>>>>
> >>>>>> 2). Sure. We can add temporary model
> >>>>>>
> >>>>>> 3). Make sense. We can use `show create model <name>` to display
> >> all
> >>>>>> information and `describe model <name>` to show input/output schema
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Hao
> >>>>>>
> >>>>>> On Mon, Mar 25, 2024 at 3:21 PM Hao Li <h...@confluent.io> wrote:
> >>>>>>
> >>>>>>> Hi Ahmed,
> >>>>>>>
> >>>>>>> Looks like the feature freeze time for 1.20 release is June 15th.
> >> We
> >>>> can
> >>>>>>> definitely get the model DDL into 1.20. For predict and evaluate
> >>>>> functions,
> >>>>>>> if we can't get into the 1.20 release, we can get them into the
> >> 1.21
> >>>>>>> release for sure.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Hao
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 25, 2024 at 1:25 AM Timo Walther <twal...@apache.org>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Jark and Hao,
> >>>>>>>>
> >>>>>>>> thanks for the information, Jark! Great that the Calcite
> >> community
> >>>>>>>> already fixed the problem for us. +1 to adopt the simplified
> >> syntax
> >>>>>>>> asap. Maybe even before we upgrade Calcite (i.e. copy over
> >>> classes),
> >>>> if
> >>>>>>>> upgrading Calcite is too much work right now?
> >>>>>>>>
> >>>>>>>>    > Is `DESCRIPTOR` a must in the syntax?
> >>>>>>>>
> >>>>>>>> Yes, we should still stick to the standard as much as possible
> >> and
> >>>> all
> >>>>>>>> vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs.
> >>> literal
> >>>>>>>> arguments. So the final syntax of this discussion would be:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM
> >>>>>>>>      ML_PREDICT(TABLE `my_data`, `classifier_model`,
> >> DESCRIPTOR(f1,
> >>>> f2))
> >>>>>>>>
> >>>>>>>> SELECT * FROM
> >>>>>>>>      ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> >>> DESCRIPTOR(f1,
> >>>>> f2))
> >>>>>>>>
> >>>>>>>> Please double check if this is implementable with the current
> >>> stack.
> >>>> I
> >>>>>>>> fear the parser or validator might not like the "identifier"
> >>>> argument?
> >>>>>>>>
> >>>>>>>> Make sure that also these variations are supported:
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM
> >>>>>>>>      ML_PREDICT(
> >>>>>>>>        TABLE `my_data`,
> >>>>>>>>        my_cat.my_db.classifier_model,
> >>>>>>>>        DESCRIPTOR(f1, f2))
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM
> >>>>>>>>      ML_PREDICT(
> >>>>>>>>        input => TABLE `my_data`,
> >>>>>>>>        model => my_cat.my_db.classifier_model,
> >>>>>>>>        args => DESCRIPTOR(f1, f2))
> >>>>>>>>
> >>>>>>>> It might be safer and more future proof to wrap a MODEL()
> >> function
> >>>>>>>> around it. This would be more in sync with the standard that
> >>> actually
> >>>>>>>> still requires to put a TABLE() around the input argument:
> >>>>>>>>
> >>>>>>>> ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1,
> >> ....)
> >>>>>>>>
> >>>>>>>> So the safest option would be the long-term solution:
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM
> >>>>>>>>      ML_PREDICT(
> >>>>>>>>        input => TABLE(my_data),
> >>>>>>>>        model => MODEL(my_cat.my_db.classifier_model),
> >>>>>>>>        args => DESCRIPTOR(f1, f2))
> >>>>>>>>
> >>>>>>>> But I'm fine with this if others have a strong opinion:
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM
> >>>>>>>>      ML_PREDICT(
> >>>>>>>>        input => TABLE `my_data`,
> >>>>>>>>        model => my_cat.my_db.classifier_model,
> >>>>>>>>        args => DESCRIPTOR(f1, f2))
> >>>>>>>>
> >>>>>>>> Some feedback for the remainder of the FLIP:
> >>>>>>>>
> >>>>>>>> 1) Simplify catalog objects
> >>>>>>>>
> >>>>>>>> I would suggest to drop:
> >>>>>>>> CatalogModel.getModelKind()
> >>>>>>>> CatalogModel.getModelTask()
> >>>>>>>>
> >>>>>>>> A catalog object should fully resemble the DDL. And since the DDL
> >>>> puts
> >>>>>>>> those properties in the WITH clause, the catalog object should
> >> the
> >>>> same
> >>>>>>>> (i.e. put them into the `getModelOptions()`). Btw renaming this
> >>>> method
> >>>>>>>> to just `getOptions()` for consistency should be good as well.
> >>>>>>>> Internally, we can still provide enums for these frequently used
> >>>>>>>> classes. Similar to what we do in `FactoryUtil` for other
> >>> frequently
> >>>>>>>> used options.
> >>>>>>>>
> >>>>>>>> Remove `getDescription()` and `getDetailedDescription()`. They
> >>> were a
> >>>>>>>> mistake for CatalogTable and should actually be deprecated. They
> >>> got
> >>>>>>>> replaced by `getComment()` which is sufficient.
> >>>>>>>>
> >>>>>>>> 2) CREATE TEMPORARY MODEL is not supported.
> >>>>>>>>
> >>>>>>>> This is an unnecessary restriction. We should support temporary
> >>>>> versions
> >>>>>>>> of these catalog objects as well for consistency. Adding support
> >>> for
> >>>>>>>> this should be straightforward.
> >>>>>>>>
> >>>>>>>> 3) DESCRIBE | DESC } MODEL
> >>> [catalog_name.][database_name.]model_name
> >>>>>>>>
> >>>>>>>> I would suggest we support `SHOW CREATE MODEL` instead. Similar
> >> to
> >>>>> `SHOW
> >>>>>>>> CREATE TABLE`, this should show all properties. If we support
> >>>> `DESCRIBE
> >>>>>>>> MODEL` it should only list the input parameters similar to
> >>> `DESCRIBE
> >>>>>>>> TABLE` only shows the columns (not the WITH clause).
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 23.03.24 13:17, Ahmed Hamdy wrote:
> >>>>>>>>> Hi everyone,
> >>>>>>>>> +1 for this proposal, I believe it is very useful to the
> >> minimum,
> >>> It
> >>>>>>>> would
> >>>>>>>>> be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in
> >>>> PTFs
> >>>>> in
> >>>>>>>>> this FLIP as discussed.
> >>>>>>>>> IIUC this will be included in the 1.20 roadmap?
> >>>>>>>>> Best Regards
> >>>>>>>>> Ahmed Hamdy
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, 22 Mar 2024 at 23:54, Hao Li <h...@confluent.io.invalid>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Timo and Jark,
> >>>>>>>>>>
> >>>>>>>>>> I agree Oracle's syntax seems concise and more descriptive. For
> >>> the
> >>>>>>>>>> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with
> >>> Jark
> >>>>> we
> >>>>>>>> can
> >>>>>>>>>> support them as built-in PTF using `SqlTableFunction` for this
> >>>> FLIP.
> >>>>>>>> We can
> >>>>>>>>>> have a different FLIP discussing user defined PTF and adopt
> >> that
> >>>>> later
> >>>>>>>> for
> >>>>>>>>>> model functions later. To summarize, the current proposed
> >> syntax
> >>> is
> >>>>>>>>>>
> >>>>>>>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> >>>>>>>>>> `classifier_model`, f1, f2))
> >>>>>>>>>>
> >>>>>>>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> >>>>> `classifier_model`,
> >>>>>>>> f1,
> >>>>>>>>>> f2))
> >>>>>>>>>>
> >>>>>>>>>> Is `DESCRIPTOR` a must in the syntax? If so, it becomes
> >>>>>>>>>>
> >>>>>>>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> >>>>>>>>>> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))
> >>>>>>>>>>
> >>>>>>>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> >>>>> `classifier_model`,
> >>>>>>>>>> DESCRIPTOR(f1), DESCRIPTOR(f2)))
> >>>>>>>>>>
> >>>>>>>>>> If Calcite supports dropping outer table keyword, it becomes
> >>>>>>>>>>
> >>>>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`,
> >>>>>>>> `classifier_model`,
> >>>>>>>>>> DESCRIPTOR(f1), DESCRIPTOR(f2))
> >>>>>>>>>>
> >>>>>>>>>> SELECT * FROM ML_EVALUATE(TABLE `eval_data`,
> >> `classifier_model`,
> >>>>>>>>>> DESCRIPTOR(
> >>>>>>>>>> f1), DESCRIPTOR(f2))
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Hao
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Mar 22, 2024 at 9:16 AM Jark Wu <imj...@gmail.com>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sorry, I mean we can bump the Calcite version if needed in
> >> Flink
> >>>>> 1.20.
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, 22 Mar 2024 at 22:19, Jark Wu <imj...@gmail.com>
> >> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Introducing user-defined PTF is very useful in Flink, I'm +1
> >>> for
> >>>>>>>> this.
> >>>>>>>>>>>> But I think the ML model FLIP is not blocked by this, because
> >>> we
> >>>>>>>>>>>> can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> >>>>>>>>>>>> just like TUMBLE/HOP. And support user-defined ML functions
> >> as
> >>>>>>>>>>>> a future FLIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regarding the simplified PTF syntax which reduces the outer
> >>>> TABLE()
> >>>>>>>>>>>> keyword,
> >>>>>>>>>>>> it seems it was just supported[1] by the Calcite community
> >> last
> >>>>> month
> >>>>>>>>>> and
> >>>>>>>>>>>> will be
> >>>>>>>>>>>> released in the next version (v1.37). The Calcite community
> >> is
> >>>>>>>>>> preparing
> >>>>>>>>>>>> the
> >>>>>>>>>>>> 1.37 release, so we can bump the version if needed in Flink
> >>> 1.19.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jark
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]: https://issues.apache.org/jira/browse/CALCITE-6254
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, 22 Mar 2024 at 21:46, Timo Walther <
> >> twal...@apache.org
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> this is a very important change to the Flink SQL syntax but
> >> we
> >>>>> can't
> >>>>>>>>>>>>> wait until the SQL standard is ready for this. So I'm +1 on
> >>>>>>>>>> introducing
> >>>>>>>>>>>>> the MODEL concept as a first class citizen in Flink.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For your information: Over the past months I have already
> >>> spent
> >>>> a
> >>>>>>>>>>>>> significant amount of time thinking about how we can
> >> introduce
> >>>>> PTFs
> >>>>>>>> in
> >>>>>>>>>>>>> Flink. I reserved FLIP-440[1] for this purpose and I will
> >>> share
> >>>> a
> >>>>>>>>>>>>> version of this in the next 1-2 weeks.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For a good implementation of FLIP-440 and also FLIP-437, we
> >>>> should
> >>>>>>>>>>>>> evolve the PTF syntax in collaboration with Apache Calcite.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There are different syntax versions out there:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) Flink
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT * FROM
> >>>>>>>>>>>>>       TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL
> >>> '10'
> >>>>>>>>>> MINUTES));
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) SQL standard
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT * FROM
> >>>>>>>>>>>>>       TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL
> >>> '10'
> >>>>>>>>>>> MINUTES));
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 3) Oracle
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT * FROM
> >>>>>>>>>>>>>        TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As you can see above, Flink does not follow the standard
> >>>> correctly
> >>>>>>>> as
> >>>>>>>>>> it
> >>>>>>>>>>>>> would need to use `TABLE()` but this is not provided by
> >>> Calcite
> >>>>> yet.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I really like the Oracle syntax[2][3] a lot. It reduces
> >>>> necessary
> >>>>>>>>>>>>> keywords to a minimum. Personally, I would like to discuss
> >>> this
> >>>>>>>> syntax
> >>>>>>>>>>>>> in a separate FLIP and hope I will find supporters for:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT * FROM
> >>>>>>>>>>>>>       TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10'
> >>>>> MINUTES);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If we go entirely with the Oracle syntax, as you can see in
> >>> the
> >>>>>>>>>> example,
> >>>>>>>>>>>>> Oracle allows for passing identifiers directly. This would
> >>> solve
> >>>>> our
> >>>>>>>>>>>>> problems for the MODEL as well:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(
> >>>>>>>>>>>>>       data => `my_data`,
> >>>>>>>>>>>>>       model => `classifier_model`,
> >>>>>>>>>>>>>       input => DESCRIPTOR(f1, f2));
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Or we completely adopt the Oracle syntax:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(
> >>>>>>>>>>>>>       data => `my_data`,
> >>>>>>>>>>>>>       model => `classifier_model`,
> >>>>>>>>>>>>>       input => COLUMNS(f1, f2));
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Happy to create a FLIP for just this syntax question and
> >>>>> collaborate
> >>>>>>>>>>>>> with the Calcite community on this. Supporting the syntax of
> >>>>> Oracle
> >>>>>>>>>>>>> shouldn't be too hard to convince at least as parser
> >>> parameter.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
> >>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
> >>>>>>>>>>>>> [3]
> >>>>>>>>>>>
> >>>>> https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 20.03.24 17:22, Mingge Deng wrote:
> >>>>>>>>>>>>>> Thanks Jark for all the insightful comments.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We have updated the proposal per our offline discussions:
> >>>>>>>>>>>>>> 1. Model will be treated as a new relation in FlinkSQL.
> >>>>>>>>>>>>>> 2. Include the common ML predict and evaluate functions
> >> into
> >>>> the
> >>>>>>>>>> open
> >>>>>>>>>>>>>> source flink to complete the user journey.
> >>>>>>>>>>>>>>         And we should be able to extend the calcite
> >>>>> SqlTableFunction
> >>>>>>>> to
> >>>>>>>>>>>>> support
> >>>>>>>>>>>>>> these two ML functions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Mingge
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Mar 18, 2024 at 7:05 PM Jark Wu <imj...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Hao,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I meant how the table name
> >>>>>>>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`.
> >>> Probably
> >>>>> we
> >>>>>>>>>>> need
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> fetch the table definition from the catalog somewhere. Do
> >> we
> >>>>> treat
> >>>>>>>>>>>>> those
> >>>>>>>>>>>>>>> window TVF specially in parser/planner so that catalog is
> >>>> looked
> >>>>>>>> up
> >>>>>>>>>>>>> when
> >>>>>>>>>>>>>>> they are seen?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The table names are resolved and validated by Calcite
> >>>>>>>> SqlValidator.
> >>>>>>>>>>> We
> >>>>>>>>>>>>>>> don' need to fetch from catalog manually.
> >>>>>>>>>>>>>>> The specific checking logic of cumulate window happens in
> >>>>>>>>>>>>>>>
> >>>> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
> >>>>>>>>>>>>>>> The return type of SqlCumulateTableFunction is defined in
> >>>>>>>>>>>>>>> #getRowTypeInference() method.
> >>>>>>>>>>>>>>> Both are public interfaces provided by Calcite and it
> >> seems
> >>>> it's
> >>>>>>>>>> not
> >>>>>>>>>>>>>>> specially handled in parser/planner.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I didn't try that, but my gut feeling is that the
> >> framework
> >>> is
> >>>>>>>>>> ready
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> extend a customized TVF.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For what model is, I'm wondering if it has to be datatype
> >>> or
> >>>>>>>>>>> relation.
> >>>>>>>>>>>>>>> Can
> >>>>>>>>>>>>>>> it be another kind of citizen parallel to
> >>>>>>>>>>>>> datatype/relation/function/db?
> >>>>>>>>>>>>>>> Redshift also supports `show models` operation, so it
> >> seems
> >>>> it's
> >>>>>>>>>>>>> treated
> >>>>>>>>>>>>>>> specially as well?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If it is an entity only used in catalog scope (e.g., show
> >>> xxx,
> >>>>>>>>>> create
> >>>>>>>>>>>>> xxx,
> >>>>>>>>>>>>>>> drop xxx), it is fine to introduce it.
> >>>>>>>>>>>>>>> We have introduced such one before, called Module: "load
> >>>>> module",
> >>>>>>>>>>> "show
> >>>>>>>>>>>>>>> modules" [1].
> >>>>>>>>>>>>>>> But if we want to use Model in TVF parameters, it means it
> >>> has
> >>>>> to
> >>>>>>>>>> be
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>> relation or datatype, because
> >>>>>>>>>>>>>>> that is what it only accepts now.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for sharing the reason of preferring TVF instead of
> >>>>>>>> Redshift
> >>>>>>>>>>>>> way. It
> >>>>>>>>>>>>>>> sounds reasonable to me.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>      [1]:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, 15 Mar 2024 at 13:41, Hao Li
> >>> <h...@confluent.io.invalid
> >>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the pointer. Sorry for the confusion: I meant
> >>> how
> >>>>> the
> >>>>>>>>>>> table
> >>>>>>>>>>>>>>> name
> >>>>>>>>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`.
> >>>> Probably
> >>>>> we
> >>>>>>>>>>>>> need to
> >>>>>>>>>>>>>>>> fetch the table definition from the catalog somewhere. Do
> >>> we
> >>>>>>>> treat
> >>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>> window TVF specially in parser/planner so that catalog is
> >>>>> looked
> >>>>>>>>>> up
> >>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>> they are seen?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For what model is, I'm wondering if it has to be datatype
> >>> or
> >>>>>>>>>>> relation.
> >>>>>>>>>>>>>>> Can
> >>>>>>>>>>>>>>>> it be another kind of citizen parallel to
> >>>>>>>>>>>>> datatype/relation/function/db?
> >>>>>>>>>>>>>>>> Redshift also supports `show models` operation, so it
> >> seems
> >>>>> it's
> >>>>>>>>>>>>> treated
> >>>>>>>>>>>>>>>> specially as well? The reasons I don't like Redshift's
> >>> syntax
> >>>>>>>> are:
> >>>>>>>>>>>>>>>> 1. It's a bit verbose, you need to think of a model name
> >> as
> >>>>> well
> >>>>>>>>>> as
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>>> function name and the function name also needs to be
> >>> unique.
> >>>>>>>>>>>>>>>> 2. More importantly, prediction function isn't the only
> >>>>> function
> >>>>>>>>>>> that
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> operate on models. There could be a set of inference
> >>>> functions
> >>>>>>>> [1]
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> evaluation functions [2] which can operate on models.
> >> It's
> >>>> hard
> >>>>>>>> to
> >>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>> all of them in model creation.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
> >>>>>>>>>>>>>>>> [2]:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Mar 14, 2024 at 8:18 PM Jark Wu <
> >> imj...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi Hao,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Can you send me some pointers
> >>>>>>>>>>>>>>>>> where the function gets the table information?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Here is the code of cumulate window type checking [1].
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Also is it possible to support <query_stmt> in
> >>>>>>>>>>>>>>>>> window functions in addiction to table?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Yes. It is not allowed in TVF.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for the syntax links of other systems. The
> >> reason I
> >>>>>>>> prefer
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> Redshift way is
> >>>>>>>>>>>>>>>>> that it avoids introducing Model as a relation or
> >> datatype
> >>>>>>>>>>>>> (referenced
> >>>>>>>>>>>>>>>> as a
> >>>>>>>>>>>>>>>>> parameter in TVF).
> >>>>>>>>>>>>>>>>> Model is not a relation because it can be queried
> >> directly
> >>>>>>>> (e.g.,
> >>>>>>>>>>>>>>> SELECT
> >>>>>>>>>>>>>>>> *
> >>>>>>>>>>>>>>>>> FROM model).
> >>>>>>>>>>>>>>>>> I'm also confused about making Model as a datatype,
> >>> because
> >>>> I
> >>>>>>>>>> don't
> >>>>>>>>>>>>>>> know
> >>>>>>>>>>>>>>>>> what class the
> >>>>>>>>>>>>>>>>> model parameter of the eval method of
> >>>>>>>>>> TableFunction/ScalarFunction
> >>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> be. By defining
> >>>>>>>>>>>>>>>>> the function with the model, users can directly invoke
> >> the
> >>>>>>>>>> function
> >>>>>>>>>>>>>>>> without
> >>>>>>>>>>>>>>>>> reference to the model name.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, 15 Mar 2024 at 02:48, Hao Li
> >>>> <h...@confluent.io.invalid
> >>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the pointers. It's very helpful.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 1. Looks like `tumble`, `hopping` are keywords in
> >> calcite
> >>>>>>>>>> parser.
> >>>>>>>>>>>>> And
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> syntax `cumulate(Table my_table, ...)` needs to get
> >> table
> >>>>>>>>>>>>> information
> >>>>>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>>> catalog somewhere for type validation etc. Can you send
> >>> me
> >>>>> some
> >>>>>>>>>>>>>>>> pointers
> >>>>>>>>>>>>>>>>>> where the function gets the table information?
> >>>>>>>>>>>>>>>>>> 2. The ideal syntax for model function I think would be
> >>>>>>>>>>>>>>>> `ML_PREDICT(MODEL
> >>>>>>>>>>>>>>>>>> <model_name>, {table <table_name> | (query_stmt) })`. I
> >>>> think
> >>>>>>>>>> with
> >>>>>>>>>>>>>>>>> special
> >>>>>>>>>>>>>>>>>> handling of the `ML_PREDICT` function in
> >> parser/planner,
> >>>>> maybe
> >>>>>>>>>> we
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>>> this like window functions. But to support `MODEL`
> >>> keyword,
> >>>>> we
> >>>>>>>>>>> need
> >>>>>>>>>>>>>>>>> calcite
> >>>>>>>>>>>>>>>>>> parser change I guess. Also is it possible to support
> >>>>>>>>>> <query_stmt>
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>> window functions in addiction to table?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> For the redshift syntax, I'm not sure the purpose of
> >>>> defining
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>> name with the model. Is it to define the function
> >>>>> input/output
> >>>>>>>>>>>>>>> schema?
> >>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>> have the schema in our create model syntax and the
> >>>>> `ML_PREDICT`
> >>>>>>>>>>> can
> >>>>>>>>>>>>>>>>> handle
> >>>>>>>>>>>>>>>>>> it by getting model definition. I think our syntax is
> >>> more
> >>>>>>>>>> concise
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> a generic prediction function. I also did some research
> >>> and
> >>>>>>>> it's
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> syntax
> >>>>>>>>>>>>>>>>>> used by Databricks `ai_query` [1], Snowflake `predict`
> >>> [2],
> >>>>>>>>>>> Azureml
> >>>>>>>>>>>>>>>>>> `predict` [3].
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
> >>>>>>>>>>>>>>>>>> [2]:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
> >>>>>>>>>>>>>>>>>> [3]:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Wed, Mar 13, 2024 at 8:57 PM Jark Wu <
> >>> imj...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Mingge, Hao,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for your replies.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> PTF is actually the ideal approach for model
> >> functions,
> >>>> and
> >>>>>>>> we
> >>>>>>>>>>> do
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>> the plans to use PTF for
> >>>>>>>>>>>>>>>>>>> all model functions (including prediction, evaluation
> >>>> etc..)
> >>>>>>>>>> once
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> PTF
> >>>>>>>>>>>>>>>>>>> is supported in FlinkSQL
> >>>>>>>>>>>>>>>>>>> confluent extension.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> It sounds that PTF is the ideal way and table function
> >>> is
> >>>> a
> >>>>>>>>>>>>>>> temporary
> >>>>>>>>>>>>>>>>>>> solution which will be dropped in the future.
> >>>>>>>>>>>>>>>>>>> I'm not sure whether we can implement it using PTF in
> >>>> Flink
> >>>>>>>>>> SQL.
> >>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>> have implemented window
> >>>>>>>>>>>>>>>>>>> functions using PTF[1]. And introduced a new window
> >>>> function
> >>>>>>>>>>>>>>> (called
> >>>>>>>>>>>>>>>>>>> CUMULATE[2]) in Flink SQL based
> >>>>>>>>>>>>>>>>>>> on this. I think it might work to use PTF and
> >> implement
> >>>>> model
> >>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>> syntax like this:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> SELECT * FROM TABLE(ML_PREDICT(
> >>>>>>>>>>>>>>>>>>>       TABLE my_table,
> >>>>>>>>>>>>>>>>>>>       my_model,
> >>>>>>>>>>>>>>>>>>>       col1,
> >>>>>>>>>>>>>>>>>>>       col2
> >>>>>>>>>>>>>>>>>>> ));
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Besides, did you consider following the way of AWS
> >>>> Redshift
> >>>>>>>>>> which
> >>>>>>>>>>>>>>>>> defines
> >>>>>>>>>>>>>>>>>>> model function with the model itself together?
> >>>>>>>>>>>>>>>>>>> IIUC, a model is a black-box which defines input
> >>>> parameters
> >>>>>>>> and
> >>>>>>>>>>>>>>>> output
> >>>>>>>>>>>>>>>>>>> parameters which can be modeled into functions.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
> >>>>>>>>>>>>>>>>>>> [2]:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
> >>>>>>>>>>>>>>>>>>> [3]:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 15:00, Hao Li
> >>>>> <h...@confluent.io.invalid
> >>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for your questions. These are good questions!
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. The polymorphism table function I was referring to
> >>>>> takes a
> >>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> input and outputs a table. So the syntax would be
> >> like
> >>>>>>>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>>>>>>> SELECT * FROM ML_PREDICT('model', (SELECT * FROM
> >>>> my_table))
> >>>>>>>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>>>>>>> As far as I know, this is not supported yet on Flink.
> >>> So
> >>>>>>>>>> before
> >>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>> supported, one option for the predict function is
> >> using
> >>>>> table
> >>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>> which can output multiple columns
> >>>>>>>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>>>>>>> SELECT * FROM my_table, LATERAL VIEW
> >>> (ML_PREDICT('model',
> >>>>>>>>>> col1,
> >>>>>>>>>>>>>>>>> col2))
> >>>>>>>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. Good question. Type inference is hard for the
> >>>>> `ML_PREDICT`
> >>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>> because it takes a model name string as input. I can
> >>>> think
> >>>>> of
> >>>>>>>>>>>>>>> three
> >>>>>>>>>>>>>>>>>> ways
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> doing type inference for it.
> >>>>>>>>>>>>>>>>>>>>        1). Treat `ML_PREDICT` function as something
> >>>> special
> >>>>> and
> >>>>>>>>>>>>>>> during
> >>>>>>>>>>>>>>>>> sql
> >>>>>>>>>>>>>>>>>>>> parsing or planning time, if it's encountered, we
> >> need
> >>> to
> >>>>>>>> look
> >>>>>>>>>>> up
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>> from the first argument which is a model name from
> >>>> catalog.
> >>>>>>>>>> Then
> >>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>> infer the input/output for the function.
> >>>>>>>>>>>>>>>>>>>>        2). We can define a `model` keyword and use
> >> that
> >>> in
> >>>>> the
> >>>>>>>>>>>>>>> predict
> >>>>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>> to indicate the argument refers to a model. So it's
> >>> like
> >>>>>>>>>>>>>>>>>>> `ML_PREDICT(model
> >>>>>>>>>>>>>>>>>>>> 'my_model', col1, col2))`
> >>>>>>>>>>>>>>>>>>>>        3). We can create a special type of table
> >>> function
> >>>>> maybe
> >>>>>>>>>>>>>>> called
> >>>>>>>>>>>>>>>>>>>> `ModelFunction` which can resolve the model type
> >>>> inference
> >>>>> by
> >>>>>>>>>>>>>>>> special
> >>>>>>>>>>>>>>>>>>>> handling it during parsing or planning time.
> >>>>>>>>>>>>>>>>>>>> 1) is hacky, 2) isn't supported in Flink for
> >> function,
> >>> 3)
> >>>>>>>>>> might
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> good option.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 3. I sketched the `ML_PREDICT` function for
> >> inference.
> >>>> But
> >>>>>>>>>> there
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> limitations of the function mentioned in 1 and 2. So
> >>>> maybe
> >>>>> we
> >>>>>>>>>>>>>>> don't
> >>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> introduce them as built-in functions until
> >> polymorphism
> >>>>> table
> >>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> we can properly deal with type inference.
> >>>>>>>>>>>>>>>>>>>> After that, defining a user-defined model function
> >>> should
> >>>>>>>> also
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> straightforward.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 4. For model types, do you mean 'remote', 'import',
> >>>>> 'native'
> >>>>>>>>>>>>>>> models
> >>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>>> other things?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 5. We could support popular providers such as
> >>> 'azureml',
> >>>>>>>>>>>>>>>> 'vertexai',
> >>>>>>>>>>>>>>>>>>>> 'googleai' as long as we support the `ML_PREDICT`
> >>>> function.
> >>>>>>>>>>> Users
> >>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> able to implement 3rd-party providers if they can
> >>>>> implement a
> >>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>>>> handling the input/output for the provider.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I think for the model functions, there are still
> >>>>> dependencies
> >>>>>>>>>> or
> >>>>>>>>>>>>>>>>> hacks
> >>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>> need to sort out as a built-in function. Maybe we can
> >>>>>>>> separate
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> follow up if we want to have it built-in and focus on
> >>> the
> >>>>>>>>>> model
> >>>>>>>>>>>>>>>>> syntax
> >>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> this FLIP?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Tue, Mar 12, 2024 at 10:33 PM Jark Wu <
> >>>> imj...@gmail.com
> >>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Minge, Chris, Hao,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for proposing this interesting idea. I think
> >>> this
> >>>>> is
> >>>>>>>> a
> >>>>>>>>>>>>>>>> nice
> >>>>>>>>>>>>>>>>>> step
> >>>>>>>>>>>>>>>>>>>>> towards
> >>>>>>>>>>>>>>>>>>>>> the AI world for Apache Flink. I don't know much
> >> about
> >>>>>>>> AI/ML,
> >>>>>>>>>>>>>>> so
> >>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> may
> >>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>> some stupid questions.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. Could you tell more about why polymorphism table
> >>>>> function
> >>>>>>>>>>>>>>>> (PTF)
> >>>>>>>>>>>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>>>>>>>> work and do we have plan to use PTF as model
> >>> functions?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. What kind of object does the model map to in
> >> SQL? A
> >>>>>>>>>> relation
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>> type?
> >>>>>>>>>>>>>>>>>>>>> It looks like a data type because we use it as a
> >>>> parameter
> >>>>>>>> of
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>> function.
> >>>>>>>>>>>>>>>>>>>>> If it is a data type, how does it cooperate with
> >> type
> >>>>>>>>>>>>>>>> inference[1]?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 3. What built-in model functions will we support?
> >> How
> >>> to
> >>>>>>>>>>>>>>> define a
> >>>>>>>>>>>>>>>>>>>>> user-defined model function?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 4. What built-in model types will we support? How to
> >>>>> define
> >>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> user-defined
> >>>>>>>>>>>>>>>>>>>>> model type?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 5. Regarding the remote model, what providers will
> >> we
> >>>>>>>>>> support?
> >>>>>>>>>>>>>>>> Can
> >>>>>>>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>>>>>>> 3rd-party providers except OpenAI?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 05:55, Hao Li
> >>>>>>>>>> <h...@confluent.io.invalid
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi, Dev
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Mingge, Chris and I would like to start a
> >> discussion
> >>>>> about
> >>>>>>>>>>>>>>>>>> FLIP-437:
> >>>>>>>>>>>>>>>>>>>>>> Support ML Models in Flink SQL.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> This FLIP is proposing to support machine learning
> >>>> models
> >>>>>>>> in
> >>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>>>>>>>> syntax so that users can CRUD models with Flink SQL
> >>> and
> >>>>> use
> >>>>>>>>>>>>>>>>> models
> >>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>>>> to do prediction with Flink data. The FLIP also
> >>>> proposes
> >>>>>>>> new
> >>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>> entities
> >>>>>>>>>>>>>>>>>>>>>> and changes to catalog interface to support model
> >>> CRUD
> >>>>>>>>>>>>>>>> operations
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> catalog.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> For more details, see FLIP-437 [1]. Looking forward
> >>> to
> >>>>> your
> >>>>>>>>>>>>>>>>>> feedback.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>> Minge, Chris & Hao
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

Reply via email to