Hi Timo,

Yeah. For `primary key` and `from table(...)` those are explicitly matched
in parser: [1].

> SELECT f1, f2, label FROM
   ML_PREDICT(
     input => `my_data`,
     model => `my_cat`.`my_db`.`classifier_model`,
     args => DESCRIPTOR(f1, f2));

This named argument syntax looks good to me. It can be supported together
with

SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
`my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));

Sure. Will let you know once updated the FLIP.

[1]
https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814

Thanks,
Hao

On Tue, Mar 26, 2024 at 4:15 AM Timo Walther <twal...@apache.org> wrote:

> Hi Hao,
>
>  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
>  > work since `TABLE` and `MODEL` are already key words
>
> This argument doesn't count. The parser supports introducing keywords
> that are still non-reserved. For example, this enables using "key" for
> both primary key and a column name:
>
> CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> WITH ('connector' = 'datagen');
>
> SELECT i AS key FROM t;
>
> I'm sure we will introduce `TABLE(my_data)` eventually as this is what
> the standard dictates. But for now, let's use the most compact syntax
> possible which is also in sync with Oracle.
>
> TLDR: We allow identifiers as arguments for PTFs which are expanded with
> catalog and database if necessary. Those identifier arguments translate
> to catalog lookups for table and models. The ML_ functions will make
> sure that the arguments are of correct type model or table.
>
> SELECT f1, f2, label FROM
>    ML_PREDICT(
>      input => `my_data`,
>      model => `my_cat`.`my_db`.`classifier_model`,
>      args => DESCRIPTOR(f1, f2));
>
> So this will allow us to also use in the future:
>
> SELECT * FROM poly_func(table1);
>
> Same support as Oracle [1]. Very concise.
>
> Let me know when you updated the FLIP for a final review before voting.
>
> Do others have additional objections?
>
> Regards,
> Timo
>
> [1]
>
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
>
>
>
> On 25.03.24 23:40, Hao Li wrote:
> > Hi Timo,
> >
> >> Please double check if this is implementable with the current stack. I
> > fear the parser or validator might not like the "identifier" argument?
> >
> > I checked this, currently the validator throws an exception trying to get
> > the full qualifier name for `classifier_model`. But since
> > `SqlValidatorImpl` is implemented in Flink, we should be able to fix
> this.
> > The only caveator is if not full model path is provided,
> > the qualifier is interpreted as a column. We should be able to special
> > handle this by rewriting the `ml_predict` function to add the catalog and
> > database name in `FlinkCalciteSqlValidator` though.
> >
> >> SELECT f1, f2, label FROM
> >     ML_PREDICT(
> >       TABLE `my_data`,
> >       my_cat.my_db.classifier_model,
> >       DESCRIPTOR(f1, f2))
> >
> > SELECT f1, f2, label FROM
> >     ML_PREDICT(
> >       input => TABLE `my_data`,
> >       model => my_cat.my_db.classifier_model,
> >       args => DESCRIPTOR(f1, f2))
> >
> > I verified these can be parsed. The problem is in validator for qualifier
> > as mentioned above.
> >
> >> So the safest option would be the long-term solution:
> >
> > SELECT f1, f2, label FROM
> >     ML_PREDICT(
> >       input => TABLE(my_data),
> >       model => MODEL(my_cat.my_db.classifier_model),
> >       args => DESCRIPTOR(f1, f2))
> >
> > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't work
> > since `TABLE` and `MODEL` are already key words in calcite used by
> `CREATE
> > TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and will be
> > treated as a function.
> >
> > So I think
> >
> > SELECT f1, f2, label FROM
> >     ML_PREDICT(
> >       input => TABLE `my_data`,
> >       model => my_cat.my_db.classifier_model,
> >       args => DESCRIPTOR(f1, f2))
> > should be fine for now.
> >
> > For the syntax part:
> > 1). Sounds good. We can drop model task and model kind from the
> definition.
> > They can be deduced from the options.
> >
> > 2). Sure. We can add temporary model
> >
> > 3). Make sense. We can use `show create model <name>` to display all
> > information and `describe model <name>` to show input/output schema
> >
> > Thanks,
> > Hao
> >
> > On Mon, Mar 25, 2024 at 3:21 PM Hao Li <h...@confluent.io> wrote:
> >
> >> Hi Ahmed,
> >>
> >> Looks like the feature freeze time for 1.20 release is June 15th. We can
> >> definitely get the model DDL into 1.20. For predict and evaluate
> functions,
> >> if we can't get into the 1.20 release, we can get them into the 1.21
> >> release for sure.
> >>
> >> Thanks,
> >> Hao
> >>
> >>
> >>
> >> On Mon, Mar 25, 2024 at 1:25 AM Timo Walther <twal...@apache.org>
> wrote:
> >>
> >>> Hi Jark and Hao,
> >>>
> >>> thanks for the information, Jark! Great that the Calcite community
> >>> already fixed the problem for us. +1 to adopt the simplified syntax
> >>> asap. Maybe even before we upgrade Calcite (i.e. copy over classes), if
> >>> upgrading Calcite is too much work right now?
> >>>
> >>>   > Is `DESCRIPTOR` a must in the syntax?
> >>>
> >>> Yes, we should still stick to the standard as much as possible and all
> >>> vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs. literal
> >>> arguments. So the final syntax of this discussion would be:
> >>>
> >>>
> >>> SELECT f1, f2, label FROM
> >>>     ML_PREDICT(TABLE `my_data`, `classifier_model`, DESCRIPTOR(f1, f2))
> >>>
> >>> SELECT * FROM
> >>>     ML_EVALUATE(TABLE `eval_data`, `classifier_model`, DESCRIPTOR(f1,
> f2))
> >>>
> >>> Please double check if this is implementable with the current stack. I
> >>> fear the parser or validator might not like the "identifier" argument?
> >>>
> >>> Make sure that also these variations are supported:
> >>>
> >>> SELECT f1, f2, label FROM
> >>>     ML_PREDICT(
> >>>       TABLE `my_data`,
> >>>       my_cat.my_db.classifier_model,
> >>>       DESCRIPTOR(f1, f2))
> >>>
> >>> SELECT f1, f2, label FROM
> >>>     ML_PREDICT(
> >>>       input => TABLE `my_data`,
> >>>       model => my_cat.my_db.classifier_model,
> >>>       args => DESCRIPTOR(f1, f2))
> >>>
> >>> It might be safer and more future proof to wrap a MODEL() function
> >>> around it. This would be more in sync with the standard that actually
> >>> still requires to put a TABLE() around the input argument:
> >>>
> >>> ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1, ....)
> >>>
> >>> So the safest option would be the long-term solution:
> >>>
> >>> SELECT f1, f2, label FROM
> >>>     ML_PREDICT(
> >>>       input => TABLE(my_data),
> >>>       model => MODEL(my_cat.my_db.classifier_model),
> >>>       args => DESCRIPTOR(f1, f2))
> >>>
> >>> But I'm fine with this if others have a strong opinion:
> >>>
> >>> SELECT f1, f2, label FROM
> >>>     ML_PREDICT(
> >>>       input => TABLE `my_data`,
> >>>       model => my_cat.my_db.classifier_model,
> >>>       args => DESCRIPTOR(f1, f2))
> >>>
> >>> Some feedback for the remainder of the FLIP:
> >>>
> >>> 1) Simplify catalog objects
> >>>
> >>> I would suggest to drop:
> >>> CatalogModel.getModelKind()
> >>> CatalogModel.getModelTask()
> >>>
> >>> A catalog object should fully resemble the DDL. And since the DDL puts
> >>> those properties in the WITH clause, the catalog object should the same
> >>> (i.e. put them into the `getModelOptions()`). Btw renaming this method
> >>> to just `getOptions()` for consistency should be good as well.
> >>> Internally, we can still provide enums for these frequently used
> >>> classes. Similar to what we do in `FactoryUtil` for other frequently
> >>> used options.
> >>>
> >>> Remove `getDescription()` and `getDetailedDescription()`. They were a
> >>> mistake for CatalogTable and should actually be deprecated. They got
> >>> replaced by `getComment()` which is sufficient.
> >>>
> >>> 2) CREATE TEMPORARY MODEL is not supported.
> >>>
> >>> This is an unnecessary restriction. We should support temporary
> versions
> >>> of these catalog objects as well for consistency. Adding support for
> >>> this should be straightforward.
> >>>
> >>> 3) DESCRIBE | DESC } MODEL [catalog_name.][database_name.]model_name
> >>>
> >>> I would suggest we support `SHOW CREATE MODEL` instead. Similar to
> `SHOW
> >>> CREATE TABLE`, this should show all properties. If we support `DESCRIBE
> >>> MODEL` it should only list the input parameters similar to `DESCRIBE
> >>> TABLE` only shows the columns (not the WITH clause).
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>> On 23.03.24 13:17, Ahmed Hamdy wrote:
> >>>> Hi everyone,
> >>>> +1 for this proposal, I believe it is very useful to the minimum, It
> >>> would
> >>>> be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in PTFs
> in
> >>>> this FLIP as discussed.
> >>>> IIUC this will be included in the 1.20 roadmap?
> >>>> Best Regards
> >>>> Ahmed Hamdy
> >>>>
> >>>>
> >>>> On Fri, 22 Mar 2024 at 23:54, Hao Li <h...@confluent.io.invalid>
> wrote:
> >>>>
> >>>>> Hi Timo and Jark,
> >>>>>
> >>>>> I agree Oracle's syntax seems concise and more descriptive. For the
> >>>>> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with Jark
> we
> >>> can
> >>>>> support them as built-in PTF using `SqlTableFunction` for this FLIP.
> >>> We can
> >>>>> have a different FLIP discussing user defined PTF and adopt that
> later
> >>> for
> >>>>> model functions later. To summarize, the current proposed syntax is
> >>>>>
> >>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> >>>>> `classifier_model`, f1, f2))
> >>>>>
> >>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> `classifier_model`,
> >>> f1,
> >>>>> f2))
> >>>>>
> >>>>> Is `DESCRIPTOR` a must in the syntax? If so, it becomes
> >>>>>
> >>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> >>>>> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))
> >>>>>
> >>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> `classifier_model`,
> >>>>> DESCRIPTOR(f1), DESCRIPTOR(f2)))
> >>>>>
> >>>>> If Calcite supports dropping outer table keyword, it becomes
> >>>>>
> >>>>> SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`,
> >>> `classifier_model`,
> >>>>> DESCRIPTOR(f1), DESCRIPTOR(f2))
> >>>>>
> >>>>> SELECT * FROM ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> >>>>> DESCRIPTOR(
> >>>>> f1), DESCRIPTOR(f2))
> >>>>>
> >>>>> Thanks,
> >>>>> Hao
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 22, 2024 at 9:16 AM Jark Wu <imj...@gmail.com> wrote:
> >>>>>
> >>>>>> Sorry, I mean we can bump the Calcite version if needed in Flink
> 1.20.
> >>>>>>
> >>>>>> On Fri, 22 Mar 2024 at 22:19, Jark Wu <imj...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi Timo,
> >>>>>>>
> >>>>>>> Introducing user-defined PTF is very useful in Flink, I'm +1 for
> >>> this.
> >>>>>>> But I think the ML model FLIP is not blocked by this, because we
> >>>>>>> can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> >>>>>>> just like TUMBLE/HOP. And support user-defined ML functions as
> >>>>>>> a future FLIP.
> >>>>>>>
> >>>>>>> Regarding the simplified PTF syntax which reduces the outer TABLE()
> >>>>>>> keyword,
> >>>>>>> it seems it was just supported[1] by the Calcite community last
> month
> >>>>> and
> >>>>>>> will be
> >>>>>>> released in the next version (v1.37). The Calcite community is
> >>>>> preparing
> >>>>>>> the
> >>>>>>> 1.37 release, so we can bump the version if needed in Flink 1.19.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>> [1]: https://issues.apache.org/jira/browse/CALCITE-6254
> >>>>>>>
> >>>>>>> On Fri, 22 Mar 2024 at 21:46, Timo Walther <twal...@apache.org>
> >>> wrote:
> >>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> this is a very important change to the Flink SQL syntax but we
> can't
> >>>>>>>> wait until the SQL standard is ready for this. So I'm +1 on
> >>>>> introducing
> >>>>>>>> the MODEL concept as a first class citizen in Flink.
> >>>>>>>>
> >>>>>>>> For your information: Over the past months I have already spent a
> >>>>>>>> significant amount of time thinking about how we can introduce
> PTFs
> >>> in
> >>>>>>>> Flink. I reserved FLIP-440[1] for this purpose and I will share a
> >>>>>>>> version of this in the next 1-2 weeks.
> >>>>>>>>
> >>>>>>>> For a good implementation of FLIP-440 and also FLIP-437, we should
> >>>>>>>> evolve the PTF syntax in collaboration with Apache Calcite.
> >>>>>>>>
> >>>>>>>> There are different syntax versions out there:
> >>>>>>>>
> >>>>>>>> 1) Flink
> >>>>>>>>
> >>>>>>>> SELECT * FROM
> >>>>>>>>      TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10'
> >>>>> MINUTES));
> >>>>>>>>
> >>>>>>>> 2) SQL standard
> >>>>>>>>
> >>>>>>>> SELECT * FROM
> >>>>>>>>      TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10'
> >>>>>> MINUTES));
> >>>>>>>>
> >>>>>>>> 3) Oracle
> >>>>>>>>
> >>>>>>>> SELECT * FROM
> >>>>>>>>       TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
> >>>>>>>>
> >>>>>>>> As you can see above, Flink does not follow the standard correctly
> >>> as
> >>>>> it
> >>>>>>>> would need to use `TABLE()` but this is not provided by Calcite
> yet.
> >>>>>>>>
> >>>>>>>> I really like the Oracle syntax[2][3] a lot. It reduces necessary
> >>>>>>>> keywords to a minimum. Personally, I would like to discuss this
> >>> syntax
> >>>>>>>> in a separate FLIP and hope I will find supporters for:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> SELECT * FROM
> >>>>>>>>      TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10'
> MINUTES);
> >>>>>>>>
> >>>>>>>> If we go entirely with the Oracle syntax, as you can see in the
> >>>>> example,
> >>>>>>>> Oracle allows for passing identifiers directly. This would solve
> our
> >>>>>>>> problems for the MODEL as well:
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(
> >>>>>>>>      data => `my_data`,
> >>>>>>>>      model => `classifier_model`,
> >>>>>>>>      input => DESCRIPTOR(f1, f2));
> >>>>>>>>
> >>>>>>>> Or we completely adopt the Oracle syntax:
> >>>>>>>>
> >>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(
> >>>>>>>>      data => `my_data`,
> >>>>>>>>      model => `classifier_model`,
> >>>>>>>>      input => COLUMNS(f1, f2));
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Happy to create a FLIP for just this syntax question and
> collaborate
> >>>>>>>> with the Calcite community on this. Supporting the syntax of
> Oracle
> >>>>>>>> shouldn't be too hard to convince at least as parser parameter.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
> >>>>>>>> [2]
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
> >>>>>>>> [3]
> >>>>>>
> https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 20.03.24 17:22, Mingge Deng wrote:
> >>>>>>>>> Thanks Jark for all the insightful comments.
> >>>>>>>>>
> >>>>>>>>> We have updated the proposal per our offline discussions:
> >>>>>>>>> 1. Model will be treated as a new relation in FlinkSQL.
> >>>>>>>>> 2. Include the common ML predict and evaluate functions into the
> >>>>> open
> >>>>>>>>> source flink to complete the user journey.
> >>>>>>>>>        And we should be able to extend the calcite
> SqlTableFunction
> >>> to
> >>>>>>>> support
> >>>>>>>>> these two ML functions.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Mingge
> >>>>>>>>>
> >>>>>>>>> On Mon, Mar 18, 2024 at 7:05 PM Jark Wu <imj...@gmail.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Hao,
> >>>>>>>>>>
> >>>>>>>>>>> I meant how the table name
> >>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`. Probably
> we
> >>>>>> need
> >>>>>>>> to
> >>>>>>>>>> fetch the table definition from the catalog somewhere. Do we
> treat
> >>>>>>>> those
> >>>>>>>>>> window TVF specially in parser/planner so that catalog is looked
> >>> up
> >>>>>>>> when
> >>>>>>>>>> they are seen?
> >>>>>>>>>>
> >>>>>>>>>> The table names are resolved and validated by Calcite
> >>> SqlValidator.
> >>>>>> We
> >>>>>>>>>> don' need to fetch from catalog manually.
> >>>>>>>>>> The specific checking logic of cumulate window happens in
> >>>>>>>>>> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
> >>>>>>>>>> The return type of SqlCumulateTableFunction is defined in
> >>>>>>>>>> #getRowTypeInference() method.
> >>>>>>>>>> Both are public interfaces provided by Calcite and it seems it's
> >>>>> not
> >>>>>>>>>> specially handled in parser/planner.
> >>>>>>>>>>
> >>>>>>>>>> I didn't try that, but my gut feeling is that the framework is
> >>>>> ready
> >>>>>> to
> >>>>>>>>>> extend a customized TVF.
> >>>>>>>>>>
> >>>>>>>>>>> For what model is, I'm wondering if it has to be datatype or
> >>>>>> relation.
> >>>>>>>>>> Can
> >>>>>>>>>> it be another kind of citizen parallel to
> >>>>>>>> datatype/relation/function/db?
> >>>>>>>>>> Redshift also supports `show models` operation, so it seems it's
> >>>>>>>> treated
> >>>>>>>>>> specially as well?
> >>>>>>>>>>
> >>>>>>>>>> If it is an entity only used in catalog scope (e.g., show xxx,
> >>>>> create
> >>>>>>>> xxx,
> >>>>>>>>>> drop xxx), it is fine to introduce it.
> >>>>>>>>>> We have introduced such one before, called Module: "load
> module",
> >>>>>> "show
> >>>>>>>>>> modules" [1].
> >>>>>>>>>> But if we want to use Model in TVF parameters, it means it has
> to
> >>>>> be
> >>>>>> a
> >>>>>>>>>> relation or datatype, because
> >>>>>>>>>> that is what it only accepts now.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for sharing the reason of preferring TVF instead of
> >>> Redshift
> >>>>>>>> way. It
> >>>>>>>>>> sounds reasonable to me.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>>     [1]:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/
> >>>>>>>>>>
> >>>>>>>>>> On Fri, 15 Mar 2024 at 13:41, Hao Li <h...@confluent.io.invalid>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the pointer. Sorry for the confusion: I meant how
> the
> >>>>>> table
> >>>>>>>>>> name
> >>>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`. Probably
> we
> >>>>>>>> need to
> >>>>>>>>>>> fetch the table definition from the catalog somewhere. Do we
> >>> treat
> >>>>>>>> those
> >>>>>>>>>>> window TVF specially in parser/planner so that catalog is
> looked
> >>>>> up
> >>>>>>>> when
> >>>>>>>>>>> they are seen?
> >>>>>>>>>>>
> >>>>>>>>>>> For what model is, I'm wondering if it has to be datatype or
> >>>>>> relation.
> >>>>>>>>>> Can
> >>>>>>>>>>> it be another kind of citizen parallel to
> >>>>>>>> datatype/relation/function/db?
> >>>>>>>>>>> Redshift also supports `show models` operation, so it seems
> it's
> >>>>>>>> treated
> >>>>>>>>>>> specially as well? The reasons I don't like Redshift's syntax
> >>> are:
> >>>>>>>>>>> 1. It's a bit verbose, you need to think of a model name as
> well
> >>>>> as
> >>>>>> a
> >>>>>>>>>>> function name and the function name also needs to be unique.
> >>>>>>>>>>> 2. More importantly, prediction function isn't the only
> function
> >>>>>> that
> >>>>>>>> can
> >>>>>>>>>>> operate on models. There could be a set of inference functions
> >>> [1]
> >>>>>> and
> >>>>>>>>>>> evaluation functions [2] which can operate on models. It's hard
> >>> to
> >>>>>>>>>> specify
> >>>>>>>>>>> all of them in model creation.
> >>>>>>>>>>>
> >>>>>>>>>>> [1]:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
> >>>>>>>>>>> [2]:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Hao
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 14, 2024 at 8:18 PM Jark Wu <imj...@gmail.com>
> >>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Hao,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Can you send me some pointers
> >>>>>>>>>>>> where the function gets the table information?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here is the code of cumulate window type checking [1].
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Also is it possible to support <query_stmt> in
> >>>>>>>>>>>> window functions in addiction to table?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes. It is not allowed in TVF.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the syntax links of other systems. The reason I
> >>> prefer
> >>>>>> the
> >>>>>>>>>>>> Redshift way is
> >>>>>>>>>>>> that it avoids introducing Model as a relation or datatype
> >>>>>>>> (referenced
> >>>>>>>>>>> as a
> >>>>>>>>>>>> parameter in TVF).
> >>>>>>>>>>>> Model is not a relation because it can be queried directly
> >>> (e.g.,
> >>>>>>>>>> SELECT
> >>>>>>>>>>> *
> >>>>>>>>>>>> FROM model).
> >>>>>>>>>>>> I'm also confused about making Model as a datatype, because I
> >>>>> don't
> >>>>>>>>>> know
> >>>>>>>>>>>> what class the
> >>>>>>>>>>>> model parameter of the eval method of
> >>>>> TableFunction/ScalarFunction
> >>>>>>>>>> should
> >>>>>>>>>>>> be. By defining
> >>>>>>>>>>>> the function with the model, users can directly invoke the
> >>>>> function
> >>>>>>>>>>> without
> >>>>>>>>>>>> reference to the model name.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jark
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, 15 Mar 2024 at 02:48, Hao Li <h...@confluent.io.invalid
> >
> >>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the pointers. It's very helpful.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. Looks like `tumble`, `hopping` are keywords in calcite
> >>>>> parser.
> >>>>>>>> And
> >>>>>>>>>>> the
> >>>>>>>>>>>>> syntax `cumulate(Table my_table, ...)` needs to get table
> >>>>>>>> information
> >>>>>>>>>>>> from
> >>>>>>>>>>>>> catalog somewhere for type validation etc. Can you send me
> some
> >>>>>>>>>>> pointers
> >>>>>>>>>>>>> where the function gets the table information?
> >>>>>>>>>>>>> 2. The ideal syntax for model function I think would be
> >>>>>>>>>>> `ML_PREDICT(MODEL
> >>>>>>>>>>>>> <model_name>, {table <table_name> | (query_stmt) })`. I think
> >>>>> with
> >>>>>>>>>>>> special
> >>>>>>>>>>>>> handling of the `ML_PREDICT` function in parser/planner,
> maybe
> >>>>> we
> >>>>>>>> can
> >>>>>>>>>>> do
> >>>>>>>>>>>>> this like window functions. But to support `MODEL` keyword,
> we
> >>>>>> need
> >>>>>>>>>>>> calcite
> >>>>>>>>>>>>> parser change I guess. Also is it possible to support
> >>>>> <query_stmt>
> >>>>>>>> in
> >>>>>>>>>>>>> window functions in addiction to table?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For the redshift syntax, I'm not sure the purpose of defining
> >>>>> the
> >>>>>>>>>>>> function
> >>>>>>>>>>>>> name with the model. Is it to define the function
> input/output
> >>>>>>>>>> schema?
> >>>>>>>>>>> We
> >>>>>>>>>>>>> have the schema in our create model syntax and the
> `ML_PREDICT`
> >>>>>> can
> >>>>>>>>>>>> handle
> >>>>>>>>>>>>> it by getting model definition. I think our syntax is more
> >>>>> concise
> >>>>>>>> to
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> a generic prediction function. I also did some research and
> >>> it's
> >>>>>> the
> >>>>>>>>>>>> syntax
> >>>>>>>>>>>>> used by Databricks `ai_query` [1], Snowflake `predict` [2],
> >>>>>> Azureml
> >>>>>>>>>>>>> `predict` [3].
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
> >>>>>>>>>>>>> [2]:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
> >>>>>>>>>>>>> [3]:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Mar 13, 2024 at 8:57 PM Jark Wu <imj...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Mingge, Hao,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for your replies.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> PTF is actually the ideal approach for model functions, and
> >>> we
> >>>>>> do
> >>>>>>>>>>>> have
> >>>>>>>>>>>>>> the plans to use PTF for
> >>>>>>>>>>>>>> all model functions (including prediction, evaluation etc..)
> >>>>> once
> >>>>>>>>>> the
> >>>>>>>>>>>> PTF
> >>>>>>>>>>>>>> is supported in FlinkSQL
> >>>>>>>>>>>>>> confluent extension.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It sounds that PTF is the ideal way and table function is a
> >>>>>>>>>> temporary
> >>>>>>>>>>>>>> solution which will be dropped in the future.
> >>>>>>>>>>>>>> I'm not sure whether we can implement it using PTF in Flink
> >>>>> SQL.
> >>>>>>>>>> But
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> have implemented window
> >>>>>>>>>>>>>> functions using PTF[1]. And introduced a new window function
> >>>>>>>>>> (called
> >>>>>>>>>>>>>> CUMULATE[2]) in Flink SQL based
> >>>>>>>>>>>>>> on this. I think it might work to use PTF and implement
> model
> >>>>>>>>>>> function
> >>>>>>>>>>>>>> syntax like this:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> SELECT * FROM TABLE(ML_PREDICT(
> >>>>>>>>>>>>>>      TABLE my_table,
> >>>>>>>>>>>>>>      my_model,
> >>>>>>>>>>>>>>      col1,
> >>>>>>>>>>>>>>      col2
> >>>>>>>>>>>>>> ));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Besides, did you consider following the way of AWS Redshift
> >>>>> which
> >>>>>>>>>>>> defines
> >>>>>>>>>>>>>> model function with the model itself together?
> >>>>>>>>>>>>>> IIUC, a model is a black-box which defines input parameters
> >>> and
> >>>>>>>>>>> output
> >>>>>>>>>>>>>> parameters which can be modeled into functions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
> >>>>>>>>>>>>>> [2]:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
> >>>>>>>>>>>>>> [3]:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 15:00, Hao Li
> <h...@confluent.io.invalid
> >>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for your questions. These are good questions!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1. The polymorphism table function I was referring to
> takes a
> >>>>>>>>>> table
> >>>>>>>>>>>> as
> >>>>>>>>>>>>>>> input and outputs a table. So the syntax would be like
> >>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>> SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table))
> >>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>> As far as I know, this is not supported yet on Flink. So
> >>>>> before
> >>>>>>>>>>> it's
> >>>>>>>>>>>>>>> supported, one option for the predict function is using
> table
> >>>>>>>>>>>> function
> >>>>>>>>>>>>>>> which can output multiple columns
> >>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>> SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model',
> >>>>> col1,
> >>>>>>>>>>>> col2))
> >>>>>>>>>>>>>>> ```
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2. Good question. Type inference is hard for the
> `ML_PREDICT`
> >>>>>>>>>>>> function
> >>>>>>>>>>>>>>> because it takes a model name string as input. I can think
> of
> >>>>>>>>>> three
> >>>>>>>>>>>>> ways
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> doing type inference for it.
> >>>>>>>>>>>>>>>       1). Treat `ML_PREDICT` function as something special
> and
> >>>>>>>>>> during
> >>>>>>>>>>>> sql
> >>>>>>>>>>>>>>> parsing or planning time, if it's encountered, we need to
> >>> look
> >>>>>> up
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>> from the first argument which is a model name from catalog.
> >>>>> Then
> >>>>>>>>>> we
> >>>>>>>>>>>> can
> >>>>>>>>>>>>>>> infer the input/output for the function.
> >>>>>>>>>>>>>>>       2). We can define a `model` keyword and use that in
> the
> >>>>>>>>>> predict
> >>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>> to indicate the argument refers to a model. So it's like
> >>>>>>>>>>>>>> `ML_PREDICT(model
> >>>>>>>>>>>>>>> 'my_model', col1, col2))`
> >>>>>>>>>>>>>>>       3). We can create a special type of table function
> maybe
> >>>>>>>>>> called
> >>>>>>>>>>>>>>> `ModelFunction` which can resolve the model type inference
> by
> >>>>>>>>>>> special
> >>>>>>>>>>>>>>> handling it during parsing or planning time.
> >>>>>>>>>>>>>>> 1) is hacky, 2) isn't supported in Flink for function, 3)
> >>>>> might
> >>>>>>>>>> be
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>> good option.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3. I sketched the `ML_PREDICT` function for inference. But
> >>>>> there
> >>>>>>>>>>> are
> >>>>>>>>>>>>>>> limitations of the function mentioned in 1 and 2. So maybe
> we
> >>>>>>>>>> don't
> >>>>>>>>>>>>> need
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> introduce them as built-in functions until polymorphism
> table
> >>>>>>>>>>>> function
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>> we can properly deal with type inference.
> >>>>>>>>>>>>>>> After that, defining a user-defined model function should
> >>> also
> >>>>>> be
> >>>>>>>>>>>>>>> straightforward.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 4. For model types, do you mean 'remote', 'import',
> 'native'
> >>>>>>>>>> models
> >>>>>>>>>>>> or
> >>>>>>>>>>>>>>> other things?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 5. We could support popular providers such as 'azureml',
> >>>>>>>>>>> 'vertexai',
> >>>>>>>>>>>>>>> 'googleai' as long as we support the `ML_PREDICT` function.
> >>>>>> Users
> >>>>>>>>>>>>> should
> >>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>> able to implement 3rd-party providers if they can
> implement a
> >>>>>>>>>>>> function
> >>>>>>>>>>>>>>> handling the input/output for the provider.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think for the model functions, there are still
> dependencies
> >>>>> or
> >>>>>>>>>>>> hacks
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> need to sort out as a built-in function. Maybe we can
> >>> separate
> >>>>>>>>>> that
> >>>>>>>>>>>> as
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>>>> follow up if we want to have it built-in and focus on the
> >>>>> model
> >>>>>>>>>>>> syntax
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> this FLIP?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Hao
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Mar 12, 2024 at 10:33 PM Jark Wu <imj...@gmail.com
> >
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Minge, Chris, Hao,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for proposing this interesting idea. I think this
> is
> >>> a
> >>>>>>>>>>> nice
> >>>>>>>>>>>>> step
> >>>>>>>>>>>>>>>> towards
> >>>>>>>>>>>>>>>> the AI world for Apache Flink. I don't know much about
> >>> AI/ML,
> >>>>>>>>>> so
> >>>>>>>>>>> I
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>> some stupid questions.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. Could you tell more about why polymorphism table
> function
> >>>>>>>>>>> (PTF)
> >>>>>>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>>> work and do we have plan to use PTF as model functions?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2. What kind of object does the model map to in SQL? A
> >>>>> relation
> >>>>>>>>>>> or
> >>>>>>>>>>>> a
> >>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>> type?
> >>>>>>>>>>>>>>>> It looks like a data type because we use it as a parameter
> >>> of
> >>>>>>>>>> the
> >>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>> function.
> >>>>>>>>>>>>>>>> If it is a data type, how does it cooperate with type
> >>>>>>>>>>> inference[1]?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 3. What built-in model functions will we support? How to
> >>>>>>>>>> define a
> >>>>>>>>>>>>>>>> user-defined model function?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 4. What built-in model types will we support? How to
> define
> >>> a
> >>>>>>>>>>>>>>> user-defined
> >>>>>>>>>>>>>>>> model type?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 5. Regarding the remote model, what providers will we
> >>>>> support?
> >>>>>>>>>>> Can
> >>>>>>>>>>>>>> users
> >>>>>>>>>>>>>>>> implement
> >>>>>>>>>>>>>>>> 3rd-party providers except OpenAI?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 05:55, Hao Li
> >>>>> <h...@confluent.io.invalid
> >>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi, Dev
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Mingge, Chris and I would like to start a discussion
> about
> >>>>>>>>>>>>> FLIP-437:
> >>>>>>>>>>>>>>>>> Support ML Models in Flink SQL.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This FLIP is proposing to support machine learning models
> >>> in
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>>> syntax so that users can CRUD models with Flink SQL and
> use
> >>>>>>>>>>>> models
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>> to do prediction with Flink data. The FLIP also proposes
> >>> new
> >>>>>>>>>>>> model
> >>>>>>>>>>>>>>>> entities
> >>>>>>>>>>>>>>>>> and changes to catalog interface to support model CRUD
> >>>>>>>>>>> operations
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> catalog.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For more details, see FLIP-437 [1]. Looking forward to
> your
> >>>>>>>>>>>>> feedback.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> Minge, Chris & Hao
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >
>
>

Reply via email to