Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

Jark Wu Thu, 28 Mar 2024 08:31:30 -0700

Thanks, Hao,

Sounds good to me.


Best,
Jark

On Thu, 28 Mar 2024 at 01:02, Hao Li <[email protected]> wrote:

> Hi Jark,
>
> I think we can start with supporting popular model providers such as
> openai, azureml, sagemaker for remote models.
>
> Thanks,
> Hao
>
> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu <[email protected]> wrote:
>
> > Thanks for the PoC and updating,
> >
> > The final syntax looks good to me, at least it is a nice and concise
> first
> > step.
> >
> > SELECT f1, f2, label FROM
> >    ML_PREDICT(
> >      input => `my_data`,
> >      model => `my_cat`.`my_db`.`classifier_model`,
> >      args => DESCRIPTOR(f1, f2));
> >
> > Besides, what built-in models will we support in the FLIP? This might be
> > important
> > because it relates to what use cases can run with the new Flink version
> out
> > of the box.
> >
> > Best,
> > Jark
> >
> > On Wed, 27 Mar 2024 at 01:10, Hao Li <[email protected]> wrote:
> >
> > > Hi Timo,
> > >
> > > Yeah. For `primary key` and `from table(...)` those are explicitly
> > matched
> > > in parser: [1].
> > >
> > > > SELECT f1, f2, label FROM
> > >    ML_PREDICT(
> > >      input => `my_data`,
> > >      model => `my_cat`.`my_db`.`classifier_model`,
> > >      args => DESCRIPTOR(f1, f2));
> > >
> > > This named argument syntax looks good to me. It can be supported
> together
> > > with
> > >
> > > SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
> > > `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
> > >
> > > Sure. Will let you know once updated the FLIP.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
> > >
> > > Thanks,
> > > Hao
> > >
> > > On Tue, Mar 26, 2024 at 4:15 AM Timo Walther <[email protected]>
> wrote:
> > >
> > > > Hi Hao,
> > > >
> > > >  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)`
> doesn't
> > > >  > work since `TABLE` and `MODEL` are already key words
> > > >
> > > > This argument doesn't count. The parser supports introducing keywords
> > > > that are still non-reserved. For example, this enables using "key"
> for
> > > > both primary key and a column name:
> > > >
> > > > CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> > > > WITH ('connector' = 'datagen');
> > > >
> > > > SELECT i AS key FROM t;
> > > >
> > > > I'm sure we will introduce `TABLE(my_data)` eventually as this is
> what
> > > > the standard dictates. But for now, let's use the most compact syntax
> > > > possible which is also in sync with Oracle.
> > > >
> > > > TLDR: We allow identifiers as arguments for PTFs which are expanded
> > with
> > > > catalog and database if necessary. Those identifier arguments
> translate
> > > > to catalog lookups for table and models. The ML_ functions will make
> > > > sure that the arguments are of correct type model or table.
> > > >
> > > > SELECT f1, f2, label FROM
> > > >    ML_PREDICT(
> > > >      input => `my_data`,
> > > >      model => `my_cat`.`my_db`.`classifier_model`,
> > > >      args => DESCRIPTOR(f1, f2));
> > > >
> > > > So this will allow us to also use in the future:
> > > >
> > > > SELECT * FROM poly_func(table1);
> > > >
> > > > Same support as Oracle [1]. Very concise.
> > > >
> > > > Let me know when you updated the FLIP for a final review before
> voting.
> > > >
> > > > Do others have additional objections?
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> > > >
> > > >
> > > >
> > > > On 25.03.24 23:40, Hao Li wrote:
> > > > > Hi Timo,
> > > > >
> > > > >> Please double check if this is implementable with the current
> > stack. I
> > > > > fear the parser or validator might not like the "identifier"
> > argument?
> > > > >
> > > > > I checked this, currently the validator throws an exception trying
> to
> > > get
> > > > > the full qualifier name for `classifier_model`. But since
> > > > > `SqlValidatorImpl` is implemented in Flink, we should be able to
> fix
> > > > this.
> > > > > The only caveator is if not full model path is provided,
> > > > > the qualifier is interpreted as a column. We should be able to
> > special
> > > > > handle this by rewriting the `ml_predict` function to add the
> catalog
> > > and
> > > > > database name in `FlinkCalciteSqlValidator` though.
> > > > >
> > > > >> SELECT f1, f2, label FROM
> > > > >     ML_PREDICT(
> > > > >       TABLE `my_data`,
> > > > >       my_cat.my_db.classifier_model,
> > > > >       DESCRIPTOR(f1, f2))
> > > > >
> > > > > SELECT f1, f2, label FROM
> > > > >     ML_PREDICT(
> > > > >       input => TABLE `my_data`,
> > > > >       model => my_cat.my_db.classifier_model,
> > > > >       args => DESCRIPTOR(f1, f2))
> > > > >
> > > > > I verified these can be parsed. The problem is in validator for
> > > qualifier
> > > > > as mentioned above.
> > > > >
> > > > >> So the safest option would be the long-term solution:
> > > > >
> > > > > SELECT f1, f2, label FROM
> > > > >     ML_PREDICT(
> > > > >       input => TABLE(my_data),
> > > > >       model => MODEL(my_cat.my_db.classifier_model),
> > > > >       args => DESCRIPTOR(f1, f2))
> > > > >
> > > > > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't
> > > work
> > > > > since `TABLE` and `MODEL` are already key words in calcite used by
> > > > `CREATE
> > > > > TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and
> will
> > be
> > > > > treated as a function.
> > > > >
> > > > > So I think
> > > > >
> > > > > SELECT f1, f2, label FROM
> > > > >     ML_PREDICT(
> > > > >       input => TABLE `my_data`,
> > > > >       model => my_cat.my_db.classifier_model,
> > > > >       args => DESCRIPTOR(f1, f2))
> > > > > should be fine for now.
> > > > >
> > > > > For the syntax part:
> > > > > 1). Sounds good. We can drop model task and model kind from the
> > > > definition.
> > > > > They can be deduced from the options.
> > > > >
> > > > > 2). Sure. We can add temporary model
> > > > >
> > > > > 3). Make sense. We can use `show create model <name>` to display
> all
> > > > > information and `describe model <name>` to show input/output schema
> > > > >
> > > > > Thanks,
> > > > > Hao
> > > > >
> > > > > On Mon, Mar 25, 2024 at 3:21 PM Hao Li <[email protected]> wrote:
> > > > >
> > > > >> Hi Ahmed,
> > > > >>
> > > > >> Looks like the feature freeze time for 1.20 release is June 15th.
> We
> > > can
> > > > >> definitely get the model DDL into 1.20. For predict and evaluate
> > > > functions,
> > > > >> if we can't get into the 1.20 release, we can get them into the
> 1.21
> > > > >> release for sure.
> > > > >>
> > > > >> Thanks,
> > > > >> Hao
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Mar 25, 2024 at 1:25 AM Timo Walther <[email protected]>
> > > > wrote:
> > > > >>
> > > > >>> Hi Jark and Hao,
> > > > >>>
> > > > >>> thanks for the information, Jark! Great that the Calcite
> community
> > > > >>> already fixed the problem for us. +1 to adopt the simplified
> syntax
> > > > >>> asap. Maybe even before we upgrade Calcite (i.e. copy over
> > classes),
> > > if
> > > > >>> upgrading Calcite is too much work right now?
> > > > >>>
> > > > >>>   > Is `DESCRIPTOR` a must in the syntax?
> > > > >>>
> > > > >>> Yes, we should still stick to the standard as much as possible
> and
> > > all
> > > > >>> vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs.
> > literal
> > > > >>> arguments. So the final syntax of this discussion would be:
> > > > >>>
> > > > >>>
> > > > >>> SELECT f1, f2, label FROM
> > > > >>>     ML_PREDICT(TABLE `my_data`, `classifier_model`,
> DESCRIPTOR(f1,
> > > f2))
> > > > >>>
> > > > >>> SELECT * FROM
> > > > >>>     ML_EVALUATE(TABLE `eval_data`, `classifier_model`,
> > DESCRIPTOR(f1,
> > > > f2))
> > > > >>>
> > > > >>> Please double check if this is implementable with the current
> > stack.
> > > I
> > > > >>> fear the parser or validator might not like the "identifier"
> > > argument?
> > > > >>>
> > > > >>> Make sure that also these variations are supported:
> > > > >>>
> > > > >>> SELECT f1, f2, label FROM
> > > > >>>     ML_PREDICT(
> > > > >>>       TABLE `my_data`,
> > > > >>>       my_cat.my_db.classifier_model,
> > > > >>>       DESCRIPTOR(f1, f2))
> > > > >>>
> > > > >>> SELECT f1, f2, label FROM
> > > > >>>     ML_PREDICT(
> > > > >>>       input => TABLE `my_data`,
> > > > >>>       model => my_cat.my_db.classifier_model,
> > > > >>>       args => DESCRIPTOR(f1, f2))
> > > > >>>
> > > > >>> It might be safer and more future proof to wrap a MODEL()
> function
> > > > >>> around it. This would be more in sync with the standard that
> > actually
> > > > >>> still requires to put a TABLE() around the input argument:
> > > > >>>
> > > > >>> ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1,
> ....)
> > > > >>>
> > > > >>> So the safest option would be the long-term solution:
> > > > >>>
> > > > >>> SELECT f1, f2, label FROM
> > > > >>>     ML_PREDICT(
> > > > >>>       input => TABLE(my_data),
> > > > >>>       model => MODEL(my_cat.my_db.classifier_model),
> > > > >>>       args => DESCRIPTOR(f1, f2))
> > > > >>>
> > > > >>> But I'm fine with this if others have a strong opinion:
> > > > >>>
> > > > >>> SELECT f1, f2, label FROM
> > > > >>>     ML_PREDICT(
> > > > >>>       input => TABLE `my_data`,
> > > > >>>       model => my_cat.my_db.classifier_model,
> > > > >>>       args => DESCRIPTOR(f1, f2))
> > > > >>>
> > > > >>> Some feedback for the remainder of the FLIP:
> > > > >>>
> > > > >>> 1) Simplify catalog objects
> > > > >>>
> > > > >>> I would suggest to drop:
> > > > >>> CatalogModel.getModelKind()
> > > > >>> CatalogModel.getModelTask()
> > > > >>>
> > > > >>> A catalog object should fully resemble the DDL. And since the DDL
> > > puts
> > > > >>> those properties in the WITH clause, the catalog object should
> the
> > > same
> > > > >>> (i.e. put them into the `getModelOptions()`). Btw renaming this
> > > method
> > > > >>> to just `getOptions()` for consistency should be good as well.
> > > > >>> Internally, we can still provide enums for these frequently used
> > > > >>> classes. Similar to what we do in `FactoryUtil` for other
> > frequently
> > > > >>> used options.
> > > > >>>
> > > > >>> Remove `getDescription()` and `getDetailedDescription()`. They
> > were a
> > > > >>> mistake for CatalogTable and should actually be deprecated. They
> > got
> > > > >>> replaced by `getComment()` which is sufficient.
> > > > >>>
> > > > >>> 2) CREATE TEMPORARY MODEL is not supported.
> > > > >>>
> > > > >>> This is an unnecessary restriction. We should support temporary
> > > > versions
> > > > >>> of these catalog objects as well for consistency. Adding support
> > for
> > > > >>> this should be straightforward.
> > > > >>>
> > > > >>> 3) DESCRIBE | DESC } MODEL
> > [catalog_name.][database_name.]model_name
> > > > >>>
> > > > >>> I would suggest we support `SHOW CREATE MODEL` instead. Similar
> to
> > > > `SHOW
> > > > >>> CREATE TABLE`, this should show all properties. If we support
> > > `DESCRIBE
> > > > >>> MODEL` it should only list the input parameters similar to
> > `DESCRIBE
> > > > >>> TABLE` only shows the columns (not the WITH clause).
> > > > >>>
> > > > >>> Regards,
> > > > >>> Timo
> > > > >>>
> > > > >>>
> > > > >>> On 23.03.24 13:17, Ahmed Hamdy wrote:
> > > > >>>> Hi everyone,
> > > > >>>> +1 for this proposal, I believe it is very useful to the
> minimum,
> > It
> > > > >>> would
> > > > >>>> be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in
> > > PTFs
> > > > in
> > > > >>>> this FLIP as discussed.
> > > > >>>> IIUC this will be included in the 1.20 roadmap?
> > > > >>>> Best Regards
> > > > >>>> Ahmed Hamdy
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, 22 Mar 2024 at 23:54, Hao Li <[email protected]>
> > > > wrote:
> > > > >>>>
> > > > >>>>> Hi Timo and Jark,
> > > > >>>>>
> > > > >>>>> I agree Oracle's syntax seems concise and more descriptive. For
> > the
> > > > >>>>> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with
> > Jark
> > > > we
> > > > >>> can
> > > > >>>>> support them as built-in PTF using `SqlTableFunction` for this
> > > FLIP.
> > > > >>> We can
> > > > >>>>> have a different FLIP discussing user defined PTF and adopt
> that
> > > > later
> > > > >>> for
> > > > >>>>> model functions later. To summarize, the current proposed
> syntax
> > is
> > > > >>>>>
> > > > >>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> > > > >>>>> `classifier_model`, f1, f2))
> > > > >>>>>
> > > > >>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> > > > `classifier_model`,
> > > > >>> f1,
> > > > >>>>> f2))
> > > > >>>>>
> > > > >>>>> Is `DESCRIPTOR` a must in the syntax? If so, it becomes
> > > > >>>>>
> > > > >>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> > > > >>>>> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))
> > > > >>>>>
> > > > >>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> > > > `classifier_model`,
> > > > >>>>> DESCRIPTOR(f1), DESCRIPTOR(f2)))
> > > > >>>>>
> > > > >>>>> If Calcite supports dropping outer table keyword, it becomes
> > > > >>>>>
> > > > >>>>> SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`,
> > > > >>> `classifier_model`,
> > > > >>>>> DESCRIPTOR(f1), DESCRIPTOR(f2))
> > > > >>>>>
> > > > >>>>> SELECT * FROM ML_EVALUATE(TABLE `eval_data`,
> `classifier_model`,
> > > > >>>>> DESCRIPTOR(
> > > > >>>>> f1), DESCRIPTOR(f2))
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Hao
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Mar 22, 2024 at 9:16 AM Jark Wu <[email protected]>
> > wrote:
> > > > >>>>>
> > > > >>>>>> Sorry, I mean we can bump the Calcite version if needed in
> Flink
> > > > 1.20.
> > > > >>>>>>
> > > > >>>>>> On Fri, 22 Mar 2024 at 22:19, Jark Wu <[email protected]>
> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hi Timo,
> > > > >>>>>>>
> > > > >>>>>>> Introducing user-defined PTF is very useful in Flink, I'm +1
> > for
> > > > >>> this.
> > > > >>>>>>> But I think the ML model FLIP is not blocked by this, because
> > we
> > > > >>>>>>> can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> > > > >>>>>>> just like TUMBLE/HOP. And support user-defined ML functions
> as
> > > > >>>>>>> a future FLIP.
> > > > >>>>>>>
> > > > >>>>>>> Regarding the simplified PTF syntax which reduces the outer
> > > TABLE()
> > > > >>>>>>> keyword,
> > > > >>>>>>> it seems it was just supported[1] by the Calcite community
> last
> > > > month
> > > > >>>>> and
> > > > >>>>>>> will be
> > > > >>>>>>> released in the next version (v1.37). The Calcite community
> is
> > > > >>>>> preparing
> > > > >>>>>>> the
> > > > >>>>>>> 1.37 release, so we can bump the version if needed in Flink
> > 1.19.
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>> Jark
> > > > >>>>>>>
> > > > >>>>>>> [1]: https://issues.apache.org/jira/browse/CALCITE-6254
> > > > >>>>>>>
> > > > >>>>>>> On Fri, 22 Mar 2024 at 21:46, Timo Walther <
> [email protected]
> > >
> > > > >>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi everyone,
> > > > >>>>>>>>
> > > > >>>>>>>> this is a very important change to the Flink SQL syntax but
> we
> > > > can't
> > > > >>>>>>>> wait until the SQL standard is ready for this. So I'm +1 on
> > > > >>>>> introducing
> > > > >>>>>>>> the MODEL concept as a first class citizen in Flink.
> > > > >>>>>>>>
> > > > >>>>>>>> For your information: Over the past months I have already
> > spent
> > > a
> > > > >>>>>>>> significant amount of time thinking about how we can
> introduce
> > > > PTFs
> > > > >>> in
> > > > >>>>>>>> Flink. I reserved FLIP-440[1] for this purpose and I will
> > share
> > > a
> > > > >>>>>>>> version of this in the next 1-2 weeks.
> > > > >>>>>>>>
> > > > >>>>>>>> For a good implementation of FLIP-440 and also FLIP-437, we
> > > should
> > > > >>>>>>>> evolve the PTF syntax in collaboration with Apache Calcite.
> > > > >>>>>>>>
> > > > >>>>>>>> There are different syntax versions out there:
> > > > >>>>>>>>
> > > > >>>>>>>> 1) Flink
> > > > >>>>>>>>
> > > > >>>>>>>> SELECT * FROM
> > > > >>>>>>>>      TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL
> > '10'
> > > > >>>>> MINUTES));
> > > > >>>>>>>>
> > > > >>>>>>>> 2) SQL standard
> > > > >>>>>>>>
> > > > >>>>>>>> SELECT * FROM
> > > > >>>>>>>>      TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL
> > '10'
> > > > >>>>>> MINUTES));
> > > > >>>>>>>>
> > > > >>>>>>>> 3) Oracle
> > > > >>>>>>>>
> > > > >>>>>>>> SELECT * FROM
> > > > >>>>>>>>       TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
> > > > >>>>>>>>
> > > > >>>>>>>> As you can see above, Flink does not follow the standard
> > > correctly
> > > > >>> as
> > > > >>>>> it
> > > > >>>>>>>> would need to use `TABLE()` but this is not provided by
> > Calcite
> > > > yet.
> > > > >>>>>>>>
> > > > >>>>>>>> I really like the Oracle syntax[2][3] a lot. It reduces
> > > necessary
> > > > >>>>>>>> keywords to a minimum. Personally, I would like to discuss
> > this
> > > > >>> syntax
> > > > >>>>>>>> in a separate FLIP and hope I will find supporters for:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> SELECT * FROM
> > > > >>>>>>>>      TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10'
> > > > MINUTES);
> > > > >>>>>>>>
> > > > >>>>>>>> If we go entirely with the Oracle syntax, as you can see in
> > the
> > > > >>>>> example,
> > > > >>>>>>>> Oracle allows for passing identifiers directly. This would
> > solve
> > > > our
> > > > >>>>>>>> problems for the MODEL as well:
> > > > >>>>>>>>
> > > > >>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(
> > > > >>>>>>>>      data => `my_data`,
> > > > >>>>>>>>      model => `classifier_model`,
> > > > >>>>>>>>      input => DESCRIPTOR(f1, f2));
> > > > >>>>>>>>
> > > > >>>>>>>> Or we completely adopt the Oracle syntax:
> > > > >>>>>>>>
> > > > >>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(
> > > > >>>>>>>>      data => `my_data`,
> > > > >>>>>>>>      model => `classifier_model`,
> > > > >>>>>>>>      input => COLUMNS(f1, f2));
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> What do you think?
> > > > >>>>>>>>
> > > > >>>>>>>> Happy to create a FLIP for just this syntax question and
> > > > collaborate
> > > > >>>>>>>> with the Calcite community on this. Supporting the syntax of
> > > > Oracle
> > > > >>>>>>>> shouldn't be too hard to convince at least as parser
> > parameter.
> > > > >>>>>>>>
> > > > >>>>>>>> Regards,
> > > > >>>>>>>> Timo
> > > > >>>>>>>>
> > > > >>>>>>>> [1]
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
> > > > >>>>>>>> [2]
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
> > > > >>>>>>>> [3]
> > > > >>>>>>
> > > > https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On 20.03.24 17:22, Mingge Deng wrote:
> > > > >>>>>>>>> Thanks Jark for all the insightful comments.
> > > > >>>>>>>>>
> > > > >>>>>>>>> We have updated the proposal per our offline discussions:
> > > > >>>>>>>>> 1. Model will be treated as a new relation in FlinkSQL.
> > > > >>>>>>>>> 2. Include the common ML predict and evaluate functions
> into
> > > the
> > > > >>>>> open
> > > > >>>>>>>>> source flink to complete the user journey.
> > > > >>>>>>>>>        And we should be able to extend the calcite
> > > > SqlTableFunction
> > > > >>> to
> > > > >>>>>>>> support
> > > > >>>>>>>>> these two ML functions.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best,
> > > > >>>>>>>>> Mingge
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Mon, Mar 18, 2024 at 7:05 PM Jark Wu <[email protected]>
> > > > wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Hi Hao,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> I meant how the table name
> > > > >>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`.
> > Probably
> > > > we
> > > > >>>>>> need
> > > > >>>>>>>> to
> > > > >>>>>>>>>> fetch the table definition from the catalog somewhere. Do
> we
> > > > treat
> > > > >>>>>>>> those
> > > > >>>>>>>>>> window TVF specially in parser/planner so that catalog is
> > > looked
> > > > >>> up
> > > > >>>>>>>> when
> > > > >>>>>>>>>> they are seen?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> The table names are resolved and validated by Calcite
> > > > >>> SqlValidator.
> > > > >>>>>> We
> > > > >>>>>>>>>> don' need to fetch from catalog manually.
> > > > >>>>>>>>>> The specific checking logic of cumulate window happens in
> > > > >>>>>>>>>>
> > > SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
> > > > >>>>>>>>>> The return type of SqlCumulateTableFunction is defined in
> > > > >>>>>>>>>> #getRowTypeInference() method.
> > > > >>>>>>>>>> Both are public interfaces provided by Calcite and it
> seems
> > > it's
> > > > >>>>> not
> > > > >>>>>>>>>> specially handled in parser/planner.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I didn't try that, but my gut feeling is that the
> framework
> > is
> > > > >>>>> ready
> > > > >>>>>> to
> > > > >>>>>>>>>> extend a customized TVF.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> For what model is, I'm wondering if it has to be datatype
> > or
> > > > >>>>>> relation.
> > > > >>>>>>>>>> Can
> > > > >>>>>>>>>> it be another kind of citizen parallel to
> > > > >>>>>>>> datatype/relation/function/db?
> > > > >>>>>>>>>> Redshift also supports `show models` operation, so it
> seems
> > > it's
> > > > >>>>>>>> treated
> > > > >>>>>>>>>> specially as well?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> If it is an entity only used in catalog scope (e.g., show
> > xxx,
> > > > >>>>> create
> > > > >>>>>>>> xxx,
> > > > >>>>>>>>>> drop xxx), it is fine to introduce it.
> > > > >>>>>>>>>> We have introduced such one before, called Module: "load
> > > > module",
> > > > >>>>>> "show
> > > > >>>>>>>>>> modules" [1].
> > > > >>>>>>>>>> But if we want to use Model in TVF parameters, it means it
> > has
> > > > to
> > > > >>>>> be
> > > > >>>>>> a
> > > > >>>>>>>>>> relation or datatype, because
> > > > >>>>>>>>>> that is what it only accepts now.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Thanks for sharing the reason of preferring TVF instead of
> > > > >>> Redshift
> > > > >>>>>>>> way. It
> > > > >>>>>>>>>> sounds reasonable to me.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Best,
> > > > >>>>>>>>>> Jark
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>     [1]:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, 15 Mar 2024 at 13:41, Hao Li
> > <[email protected]
> > > >
> > > > >>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Hi Jark,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks for the pointer. Sorry for the confusion: I meant
> > how
> > > > the
> > > > >>>>>> table
> > > > >>>>>>>>>> name
> > > > >>>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`.
> > > Probably
> > > > we
> > > > >>>>>>>> need to
> > > > >>>>>>>>>>> fetch the table definition from the catalog somewhere. Do
> > we
> > > > >>> treat
> > > > >>>>>>>> those
> > > > >>>>>>>>>>> window TVF specially in parser/planner so that catalog is
> > > > looked
> > > > >>>>> up
> > > > >>>>>>>> when
> > > > >>>>>>>>>>> they are seen?
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> For what model is, I'm wondering if it has to be datatype
> > or
> > > > >>>>>> relation.
> > > > >>>>>>>>>> Can
> > > > >>>>>>>>>>> it be another kind of citizen parallel to
> > > > >>>>>>>> datatype/relation/function/db?
> > > > >>>>>>>>>>> Redshift also supports `show models` operation, so it
> seems
> > > > it's
> > > > >>>>>>>> treated
> > > > >>>>>>>>>>> specially as well? The reasons I don't like Redshift's
> > syntax
> > > > >>> are:
> > > > >>>>>>>>>>> 1. It's a bit verbose, you need to think of a model name
> as
> > > > well
> > > > >>>>> as
> > > > >>>>>> a
> > > > >>>>>>>>>>> function name and the function name also needs to be
> > unique.
> > > > >>>>>>>>>>> 2. More importantly, prediction function isn't the only
> > > > function
> > > > >>>>>> that
> > > > >>>>>>>> can
> > > > >>>>>>>>>>> operate on models. There could be a set of inference
> > > functions
> > > > >>> [1]
> > > > >>>>>> and
> > > > >>>>>>>>>>> evaluation functions [2] which can operate on models.
> It's
> > > hard
> > > > >>> to
> > > > >>>>>>>>>> specify
> > > > >>>>>>>>>>> all of them in model creation.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> [1]:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
> > > > >>>>>>>>>>> [2]:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> Hao
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Thu, Mar 14, 2024 at 8:18 PM Jark Wu <
> [email protected]>
> > > > >>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Hi Hao,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Can you send me some pointers
> > > > >>>>>>>>>>>> where the function gets the table information?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Here is the code of cumulate window type checking [1].
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Also is it possible to support <query_stmt> in
> > > > >>>>>>>>>>>> window functions in addiction to table?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Yes. It is not allowed in TVF.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thanks for the syntax links of other systems. The
> reason I
> > > > >>> prefer
> > > > >>>>>> the
> > > > >>>>>>>>>>>> Redshift way is
> > > > >>>>>>>>>>>> that it avoids introducing Model as a relation or
> datatype
> > > > >>>>>>>> (referenced
> > > > >>>>>>>>>>> as a
> > > > >>>>>>>>>>>> parameter in TVF).
> > > > >>>>>>>>>>>> Model is not a relation because it can be queried
> directly
> > > > >>> (e.g.,
> > > > >>>>>>>>>> SELECT
> > > > >>>>>>>>>>> *
> > > > >>>>>>>>>>>> FROM model).
> > > > >>>>>>>>>>>> I'm also confused about making Model as a datatype,
> > because
> > > I
> > > > >>>>> don't
> > > > >>>>>>>>>> know
> > > > >>>>>>>>>>>> what class the
> > > > >>>>>>>>>>>> model parameter of the eval method of
> > > > >>>>> TableFunction/ScalarFunction
> > > > >>>>>>>>>> should
> > > > >>>>>>>>>>>> be. By defining
> > > > >>>>>>>>>>>> the function with the model, users can directly invoke
> the
> > > > >>>>> function
> > > > >>>>>>>>>>> without
> > > > >>>>>>>>>>>> reference to the model name.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>> Jark
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> [1]:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Fri, 15 Mar 2024 at 02:48, Hao Li
> > > <[email protected]
> > > > >
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Hi Jark,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks for the pointers. It's very helpful.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 1. Looks like `tumble`, `hopping` are keywords in
> calcite
> > > > >>>>> parser.
> > > > >>>>>>>> And
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>> syntax `cumulate(Table my_table, ...)` needs to get
> table
> > > > >>>>>>>> information
> > > > >>>>>>>>>>>> from
> > > > >>>>>>>>>>>>> catalog somewhere for type validation etc. Can you send
> > me
> > > > some
> > > > >>>>>>>>>>> pointers
> > > > >>>>>>>>>>>>> where the function gets the table information?
> > > > >>>>>>>>>>>>> 2. The ideal syntax for model function I think would be
> > > > >>>>>>>>>>> `ML_PREDICT(MODEL
> > > > >>>>>>>>>>>>> <model_name>, {table <table_name> | (query_stmt) })`. I
> > > think
> > > > >>>>> with
> > > > >>>>>>>>>>>> special
> > > > >>>>>>>>>>>>> handling of the `ML_PREDICT` function in
> parser/planner,
> > > > maybe
> > > > >>>>> we
> > > > >>>>>>>> can
> > > > >>>>>>>>>>> do
> > > > >>>>>>>>>>>>> this like window functions. But to support `MODEL`
> > keyword,
> > > > we
> > > > >>>>>> need
> > > > >>>>>>>>>>>> calcite
> > > > >>>>>>>>>>>>> parser change I guess. Also is it possible to support
> > > > >>>>> <query_stmt>
> > > > >>>>>>>> in
> > > > >>>>>>>>>>>>> window functions in addiction to table?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> For the redshift syntax, I'm not sure the purpose of
> > > defining
> > > > >>>>> the
> > > > >>>>>>>>>>>> function
> > > > >>>>>>>>>>>>> name with the model. Is it to define the function
> > > > input/output
> > > > >>>>>>>>>> schema?
> > > > >>>>>>>>>>> We
> > > > >>>>>>>>>>>>> have the schema in our create model syntax and the
> > > > `ML_PREDICT`
> > > > >>>>>> can
> > > > >>>>>>>>>>>> handle
> > > > >>>>>>>>>>>>> it by getting model definition. I think our syntax is
> > more
> > > > >>>>> concise
> > > > >>>>>>>> to
> > > > >>>>>>>>>>>> have
> > > > >>>>>>>>>>>>> a generic prediction function. I also did some research
> > and
> > > > >>> it's
> > > > >>>>>> the
> > > > >>>>>>>>>>>> syntax
> > > > >>>>>>>>>>>>> used by Databricks `ai_query` [1], Snowflake `predict`
> > [2],
> > > > >>>>>> Azureml
> > > > >>>>>>>>>>>>> `predict` [3].
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> [1]:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
> > > > >>>>>>>>>>>>> [2]:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
> > > > >>>>>>>>>>>>> [3]:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>> Hao
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Wed, Mar 13, 2024 at 8:57 PM Jark Wu <
> > [email protected]>
> > > > >>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Hi Mingge, Hao,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Thanks for your replies.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> PTF is actually the ideal approach for model
> functions,
> > > and
> > > > >>> we
> > > > >>>>>> do
> > > > >>>>>>>>>>>> have
> > > > >>>>>>>>>>>>>> the plans to use PTF for
> > > > >>>>>>>>>>>>>> all model functions (including prediction, evaluation
> > > etc..)
> > > > >>>>> once
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>> PTF
> > > > >>>>>>>>>>>>>> is supported in FlinkSQL
> > > > >>>>>>>>>>>>>> confluent extension.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> It sounds that PTF is the ideal way and table function
> > is
> > > a
> > > > >>>>>>>>>> temporary
> > > > >>>>>>>>>>>>>> solution which will be dropped in the future.
> > > > >>>>>>>>>>>>>> I'm not sure whether we can implement it using PTF in
> > > Flink
> > > > >>>>> SQL.
> > > > >>>>>>>>>> But
> > > > >>>>>>>>>>> we
> > > > >>>>>>>>>>>>>> have implemented window
> > > > >>>>>>>>>>>>>> functions using PTF[1]. And introduced a new window
> > > function
> > > > >>>>>>>>>> (called
> > > > >>>>>>>>>>>>>> CUMULATE[2]) in Flink SQL based
> > > > >>>>>>>>>>>>>> on this. I think it might work to use PTF and
> implement
> > > > model
> > > > >>>>>>>>>>> function
> > > > >>>>>>>>>>>>>> syntax like this:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> SELECT * FROM TABLE(ML_PREDICT(
> > > > >>>>>>>>>>>>>>      TABLE my_table,
> > > > >>>>>>>>>>>>>>      my_model,
> > > > >>>>>>>>>>>>>>      col1,
> > > > >>>>>>>>>>>>>>      col2
> > > > >>>>>>>>>>>>>> ));
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Besides, did you consider following the way of AWS
> > > Redshift
> > > > >>>>> which
> > > > >>>>>>>>>>>> defines
> > > > >>>>>>>>>>>>>> model function with the model itself together?
> > > > >>>>>>>>>>>>>> IIUC, a model is a black-box which defines input
> > > parameters
> > > > >>> and
> > > > >>>>>>>>>>> output
> > > > >>>>>>>>>>>>>> parameters which can be modeled into functions.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>> Jark
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> [1]:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
> > > > >>>>>>>>>>>>>> [2]:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
> > > > >>>>>>>>>>>>>> [3]:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 15:00, Hao Li
> > > > <[email protected]
> > > > >>>>>>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Hi Jark,
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Thanks for your questions. These are good questions!
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> 1. The polymorphism table function I was referring to
> > > > takes a
> > > > >>>>>>>>>> table
> > > > >>>>>>>>>>>> as
> > > > >>>>>>>>>>>>>>> input and outputs a table. So the syntax would be
> like
> > > > >>>>>>>>>>>>>>> ```
> > > > >>>>>>>>>>>>>>> SELECT * FROM ML_PREDICT('model', (SELECT * FROM
> > > my_table))
> > > > >>>>>>>>>>>>>>> ```
> > > > >>>>>>>>>>>>>>> As far as I know, this is not supported yet on Flink.
> > So
> > > > >>>>> before
> > > > >>>>>>>>>>> it's
> > > > >>>>>>>>>>>>>>> supported, one option for the predict function is
> using
> > > > table
> > > > >>>>>>>>>>>> function
> > > > >>>>>>>>>>>>>>> which can output multiple columns
> > > > >>>>>>>>>>>>>>> ```
> > > > >>>>>>>>>>>>>>> SELECT * FROM my_table, LATERAL VIEW
> > (ML_PREDICT('model',
> > > > >>>>> col1,
> > > > >>>>>>>>>>>> col2))
> > > > >>>>>>>>>>>>>>> ```
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> 2. Good question. Type inference is hard for the
> > > > `ML_PREDICT`
> > > > >>>>>>>>>>>> function
> > > > >>>>>>>>>>>>>>> because it takes a model name string as input. I can
> > > think
> > > > of
> > > > >>>>>>>>>> three
> > > > >>>>>>>>>>>>> ways
> > > > >>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>> doing type inference for it.
> > > > >>>>>>>>>>>>>>>       1). Treat `ML_PREDICT` function as something
> > > special
> > > > and
> > > > >>>>>>>>>> during
> > > > >>>>>>>>>>>> sql
> > > > >>>>>>>>>>>>>>> parsing or planning time, if it's encountered, we
> need
> > to
> > > > >>> look
> > > > >>>>>> up
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> model
> > > > >>>>>>>>>>>>>>> from the first argument which is a model name from
> > > catalog.
> > > > >>>>> Then
> > > > >>>>>>>>>> we
> > > > >>>>>>>>>>>> can
> > > > >>>>>>>>>>>>>>> infer the input/output for the function.
> > > > >>>>>>>>>>>>>>>       2). We can define a `model` keyword and use
> that
> > in
> > > > the
> > > > >>>>>>>>>> predict
> > > > >>>>>>>>>>>>>> function
> > > > >>>>>>>>>>>>>>> to indicate the argument refers to a model. So it's
> > like
> > > > >>>>>>>>>>>>>> `ML_PREDICT(model
> > > > >>>>>>>>>>>>>>> 'my_model', col1, col2))`
> > > > >>>>>>>>>>>>>>>       3). We can create a special type of table
> > function
> > > > maybe
> > > > >>>>>>>>>> called
> > > > >>>>>>>>>>>>>>> `ModelFunction` which can resolve the model type
> > > inference
> > > > by
> > > > >>>>>>>>>>> special
> > > > >>>>>>>>>>>>>>> handling it during parsing or planning time.
> > > > >>>>>>>>>>>>>>> 1) is hacky, 2) isn't supported in Flink for
> function,
> > 3)
> > > > >>>>> might
> > > > >>>>>>>>>> be
> > > > >>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>> good option.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> 3. I sketched the `ML_PREDICT` function for
> inference.
> > > But
> > > > >>>>> there
> > > > >>>>>>>>>>> are
> > > > >>>>>>>>>>>>>>> limitations of the function mentioned in 1 and 2. So
> > > maybe
> > > > we
> > > > >>>>>>>>>> don't
> > > > >>>>>>>>>>>>> need
> > > > >>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>> introduce them as built-in functions until
> polymorphism
> > > > table
> > > > >>>>>>>>>>>> function
> > > > >>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>> we can properly deal with type inference.
> > > > >>>>>>>>>>>>>>> After that, defining a user-defined model function
> > should
> > > > >>> also
> > > > >>>>>> be
> > > > >>>>>>>>>>>>>>> straightforward.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> 4. For model types, do you mean 'remote', 'import',
> > > > 'native'
> > > > >>>>>>>>>> models
> > > > >>>>>>>>>>>> or
> > > > >>>>>>>>>>>>>>> other things?
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> 5. We could support popular providers such as
> > 'azureml',
> > > > >>>>>>>>>>> 'vertexai',
> > > > >>>>>>>>>>>>>>> 'googleai' as long as we support the `ML_PREDICT`
> > > function.
> > > > >>>>>> Users
> > > > >>>>>>>>>>>>> should
> > > > >>>>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>> able to implement 3rd-party providers if they can
> > > > implement a
> > > > >>>>>>>>>>>> function
> > > > >>>>>>>>>>>>>>> handling the input/output for the provider.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> I think for the model functions, there are still
> > > > dependencies
> > > > >>>>> or
> > > > >>>>>>>>>>>> hacks
> > > > >>>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>> need to sort out as a built-in function. Maybe we can
> > > > >>> separate
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> as
> > > > >>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>> follow up if we want to have it built-in and focus on
> > the
> > > > >>>>> model
> > > > >>>>>>>>>>>> syntax
> > > > >>>>>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>> this FLIP?
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>> Hao
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Tue, Mar 12, 2024 at 10:33 PM Jark Wu <
> > > [email protected]
> > > > >
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Hi Minge, Chris, Hao,
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Thanks for proposing this interesting idea. I think
> > this
> > > > is
> > > > >>> a
> > > > >>>>>>>>>>> nice
> > > > >>>>>>>>>>>>> step
> > > > >>>>>>>>>>>>>>>> towards
> > > > >>>>>>>>>>>>>>>> the AI world for Apache Flink. I don't know much
> about
> > > > >>> AI/ML,
> > > > >>>>>>>>>> so
> > > > >>>>>>>>>>> I
> > > > >>>>>>>>>>>>> may
> > > > >>>>>>>>>>>>>>> have
> > > > >>>>>>>>>>>>>>>> some stupid questions.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 1. Could you tell more about why polymorphism table
> > > > function
> > > > >>>>>>>>>>> (PTF)
> > > > >>>>>>>>>>>>>>> doesn't
> > > > >>>>>>>>>>>>>>>> work and do we have plan to use PTF as model
> > functions?
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 2. What kind of object does the model map to in
> SQL? A
> > > > >>>>> relation
> > > > >>>>>>>>>>> or
> > > > >>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>> data
> > > > >>>>>>>>>>>>>>>> type?
> > > > >>>>>>>>>>>>>>>> It looks like a data type because we use it as a
> > > parameter
> > > > >>> of
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> table
> > > > >>>>>>>>>>>>>>>> function.
> > > > >>>>>>>>>>>>>>>> If it is a data type, how does it cooperate with
> type
> > > > >>>>>>>>>>> inference[1]?
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 3. What built-in model functions will we support?
> How
> > to
> > > > >>>>>>>>>> define a
> > > > >>>>>>>>>>>>>>>> user-defined model function?
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 4. What built-in model types will we support? How to
> > > > define
> > > > >>> a
> > > > >>>>>>>>>>>>>>> user-defined
> > > > >>>>>>>>>>>>>>>> model type?
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 5. Regarding the remote model, what providers will
> we
> > > > >>>>> support?
> > > > >>>>>>>>>>> Can
> > > > >>>>>>>>>>>>>> users
> > > > >>>>>>>>>>>>>>>> implement
> > > > >>>>>>>>>>>>>>>> 3rd-party providers except OpenAI?
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>>>> Jark
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> [1]:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 05:55, Hao Li
> > > > >>>>> <[email protected]
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Hi, Dev
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Mingge, Chris and I would like to start a
> discussion
> > > > about
> > > > >>>>>>>>>>>>> FLIP-437:
> > > > >>>>>>>>>>>>>>>>> Support ML Models in Flink SQL.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> This FLIP is proposing to support machine learning
> > > models
> > > > >>> in
> > > > >>>>>>>>>>>> Flink
> > > > >>>>>>>>>>>>>> SQL
> > > > >>>>>>>>>>>>>>>>> syntax so that users can CRUD models with Flink SQL
> > and
> > > > use
> > > > >>>>>>>>>>>> models
> > > > >>>>>>>>>>>>> on
> > > > >>>>>>>>>>>>>>>> Flink
> > > > >>>>>>>>>>>>>>>>> to do prediction with Flink data. The FLIP also
> > > proposes
> > > > >>> new
> > > > >>>>>>>>>>>> model
> > > > >>>>>>>>>>>>>>>> entities
> > > > >>>>>>>>>>>>>>>>> and changes to catalog interface to support model
> > CRUD
> > > > >>>>>>>>>>> operations
> > > > >>>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>> catalog.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> For more details, see FLIP-437 [1]. Looking forward
> > to
> > > > your
> > > > >>>>>>>>>>>>> feedback.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> [1]
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>> Minge, Chris & Hao
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

Reply via email to