Thanks Timo. I'll start a vote tomorrow if no further discussion. Thanks, Hao
On Thu, Mar 28, 2024 at 9:33 AM Timo Walther <twal...@apache.org> wrote: > Hi everyone, > > I updated the FLIP according to this discussion. > > @Hao Li: Let me know if I made a mistake somewhere. I added some > additional explaning comments about the new PTF syntax. > > There are no further objections from my side. If nobody objects, Hao > feel free to start the voting tomorrow. > > Regards, > Timo > > > On 28.03.24 16:30, Jark Wu wrote: > > Thanks, Hao, > > > > Sounds good to me. > > > > Best, > > Jark > > > > On Thu, 28 Mar 2024 at 01:02, Hao Li <h...@confluent.io.invalid> wrote: > > > >> Hi Jark, > >> > >> I think we can start with supporting popular model providers such as > >> openai, azureml, sagemaker for remote models. > >> > >> Thanks, > >> Hao > >> > >> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu <imj...@gmail.com> wrote: > >> > >>> Thanks for the PoC and updating, > >>> > >>> The final syntax looks good to me, at least it is a nice and concise > >> first > >>> step. > >>> > >>> SELECT f1, f2, label FROM > >>> ML_PREDICT( > >>> input => `my_data`, > >>> model => `my_cat`.`my_db`.`classifier_model`, > >>> args => DESCRIPTOR(f1, f2)); > >>> > >>> Besides, what built-in models will we support in the FLIP? This might > be > >>> important > >>> because it relates to what use cases can run with the new Flink version > >> out > >>> of the box. > >>> > >>> Best, > >>> Jark > >>> > >>> On Wed, 27 Mar 2024 at 01:10, Hao Li <h...@confluent.io.invalid> wrote: > >>> > >>>> Hi Timo, > >>>> > >>>> Yeah. For `primary key` and `from table(...)` those are explicitly > >>> matched > >>>> in parser: [1]. > >>>> > >>>>> SELECT f1, f2, label FROM > >>>> ML_PREDICT( > >>>> input => `my_data`, > >>>> model => `my_cat`.`my_db`.`classifier_model`, > >>>> args => DESCRIPTOR(f1, f2)); > >>>> > >>>> This named argument syntax looks good to me. It can be supported > >> together > >>>> with > >>>> > >>>> SELECT f1, f2, label FROM ML_PREDICT(`my_data`, > >>>> `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2)); > >>>> > >>>> Sure. Will let you know once updated the FLIP. > >>>> > >>>> [1] > >>>> > >>>> > >>> > >> > https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814 > >>>> > >>>> Thanks, > >>>> Hao > >>>> > >>>> On Tue, Mar 26, 2024 at 4:15 AM Timo Walther <twal...@apache.org> > >> wrote: > >>>> > >>>>> Hi Hao, > >>>>> > >>>>> > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` > >> doesn't > >>>>> > work since `TABLE` and `MODEL` are already key words > >>>>> > >>>>> This argument doesn't count. The parser supports introducing keywords > >>>>> that are still non-reserved. For example, this enables using "key" > >> for > >>>>> both primary key and a column name: > >>>>> > >>>>> CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED) > >>>>> WITH ('connector' = 'datagen'); > >>>>> > >>>>> SELECT i AS key FROM t; > >>>>> > >>>>> I'm sure we will introduce `TABLE(my_data)` eventually as this is > >> what > >>>>> the standard dictates. But for now, let's use the most compact syntax > >>>>> possible which is also in sync with Oracle. > >>>>> > >>>>> TLDR: We allow identifiers as arguments for PTFs which are expanded > >>> with > >>>>> catalog and database if necessary. Those identifier arguments > >> translate > >>>>> to catalog lookups for table and models. The ML_ functions will make > >>>>> sure that the arguments are of correct type model or table. > >>>>> > >>>>> SELECT f1, f2, label FROM > >>>>> ML_PREDICT( > >>>>> input => `my_data`, > >>>>> model => `my_cat`.`my_db`.`classifier_model`, > >>>>> args => DESCRIPTOR(f1, f2)); > >>>>> > >>>>> So this will allow us to also use in the future: > >>>>> > >>>>> SELECT * FROM poly_func(table1); > >>>>> > >>>>> Same support as Oracle [1]. Very concise. > >>>>> > >>>>> Let me know when you updated the FLIP for a final review before > >> voting. > >>>>> > >>>>> Do others have additional objections? > >>>>> > >>>>> Regards, > >>>>> Timo > >>>>> > >>>>> [1] > >>>>> > >>>>> > >>>> > >>> > >> > https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html > >>>>> > >>>>> > >>>>> > >>>>> On 25.03.24 23:40, Hao Li wrote: > >>>>>> Hi Timo, > >>>>>> > >>>>>>> Please double check if this is implementable with the current > >>> stack. I > >>>>>> fear the parser or validator might not like the "identifier" > >>> argument? > >>>>>> > >>>>>> I checked this, currently the validator throws an exception trying > >> to > >>>> get > >>>>>> the full qualifier name for `classifier_model`. But since > >>>>>> `SqlValidatorImpl` is implemented in Flink, we should be able to > >> fix > >>>>> this. > >>>>>> The only caveator is if not full model path is provided, > >>>>>> the qualifier is interpreted as a column. We should be able to > >>> special > >>>>>> handle this by rewriting the `ml_predict` function to add the > >> catalog > >>>> and > >>>>>> database name in `FlinkCalciteSqlValidator` though. > >>>>>> > >>>>>>> SELECT f1, f2, label FROM > >>>>>> ML_PREDICT( > >>>>>> TABLE `my_data`, > >>>>>> my_cat.my_db.classifier_model, > >>>>>> DESCRIPTOR(f1, f2)) > >>>>>> > >>>>>> SELECT f1, f2, label FROM > >>>>>> ML_PREDICT( > >>>>>> input => TABLE `my_data`, > >>>>>> model => my_cat.my_db.classifier_model, > >>>>>> args => DESCRIPTOR(f1, f2)) > >>>>>> > >>>>>> I verified these can be parsed. The problem is in validator for > >>>> qualifier > >>>>>> as mentioned above. > >>>>>> > >>>>>>> So the safest option would be the long-term solution: > >>>>>> > >>>>>> SELECT f1, f2, label FROM > >>>>>> ML_PREDICT( > >>>>>> input => TABLE(my_data), > >>>>>> model => MODEL(my_cat.my_db.classifier_model), > >>>>>> args => DESCRIPTOR(f1, f2)) > >>>>>> > >>>>>> `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)` doesn't > >>>> work > >>>>>> since `TABLE` and `MODEL` are already key words in calcite used by > >>>>> `CREATE > >>>>>> TABLE`, `CREATE MODEL`. Changing to `model_name(...)` works and > >> will > >>> be > >>>>>> treated as a function. > >>>>>> > >>>>>> So I think > >>>>>> > >>>>>> SELECT f1, f2, label FROM > >>>>>> ML_PREDICT( > >>>>>> input => TABLE `my_data`, > >>>>>> model => my_cat.my_db.classifier_model, > >>>>>> args => DESCRIPTOR(f1, f2)) > >>>>>> should be fine for now. > >>>>>> > >>>>>> For the syntax part: > >>>>>> 1). Sounds good. We can drop model task and model kind from the > >>>>> definition. > >>>>>> They can be deduced from the options. > >>>>>> > >>>>>> 2). Sure. We can add temporary model > >>>>>> > >>>>>> 3). Make sense. We can use `show create model <name>` to display > >> all > >>>>>> information and `describe model <name>` to show input/output schema > >>>>>> > >>>>>> Thanks, > >>>>>> Hao > >>>>>> > >>>>>> On Mon, Mar 25, 2024 at 3:21 PM Hao Li <h...@confluent.io> wrote: > >>>>>> > >>>>>>> Hi Ahmed, > >>>>>>> > >>>>>>> Looks like the feature freeze time for 1.20 release is June 15th. > >> We > >>>> can > >>>>>>> definitely get the model DDL into 1.20. For predict and evaluate > >>>>> functions, > >>>>>>> if we can't get into the 1.20 release, we can get them into the > >> 1.21 > >>>>>>> release for sure. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Hao > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 25, 2024 at 1:25 AM Timo Walther <twal...@apache.org> > >>>>> wrote: > >>>>>>> > >>>>>>>> Hi Jark and Hao, > >>>>>>>> > >>>>>>>> thanks for the information, Jark! Great that the Calcite > >> community > >>>>>>>> already fixed the problem for us. +1 to adopt the simplified > >> syntax > >>>>>>>> asap. Maybe even before we upgrade Calcite (i.e. copy over > >>> classes), > >>>> if > >>>>>>>> upgrading Calcite is too much work right now? > >>>>>>>> > >>>>>>>> > Is `DESCRIPTOR` a must in the syntax? > >>>>>>>> > >>>>>>>> Yes, we should still stick to the standard as much as possible > >> and > >>>> all > >>>>>>>> vendors use DESCRIPTOR/COLUMNS for distinuishing columns vs. > >>> literal > >>>>>>>> arguments. So the final syntax of this discussion would be: > >>>>>>>> > >>>>>>>> > >>>>>>>> SELECT f1, f2, label FROM > >>>>>>>> ML_PREDICT(TABLE `my_data`, `classifier_model`, > >> DESCRIPTOR(f1, > >>>> f2)) > >>>>>>>> > >>>>>>>> SELECT * FROM > >>>>>>>> ML_EVALUATE(TABLE `eval_data`, `classifier_model`, > >>> DESCRIPTOR(f1, > >>>>> f2)) > >>>>>>>> > >>>>>>>> Please double check if this is implementable with the current > >>> stack. > >>>> I > >>>>>>>> fear the parser or validator might not like the "identifier" > >>>> argument? > >>>>>>>> > >>>>>>>> Make sure that also these variations are supported: > >>>>>>>> > >>>>>>>> SELECT f1, f2, label FROM > >>>>>>>> ML_PREDICT( > >>>>>>>> TABLE `my_data`, > >>>>>>>> my_cat.my_db.classifier_model, > >>>>>>>> DESCRIPTOR(f1, f2)) > >>>>>>>> > >>>>>>>> SELECT f1, f2, label FROM > >>>>>>>> ML_PREDICT( > >>>>>>>> input => TABLE `my_data`, > >>>>>>>> model => my_cat.my_db.classifier_model, > >>>>>>>> args => DESCRIPTOR(f1, f2)) > >>>>>>>> > >>>>>>>> It might be safer and more future proof to wrap a MODEL() > >> function > >>>>>>>> around it. This would be more in sync with the standard that > >>> actually > >>>>>>>> still requires to put a TABLE() around the input argument: > >>>>>>>> > >>>>>>>> ML_PREDICT(TABLE(`my_data`) PARTITIONED BY c1 ORDERED BY c1, > >> ....) > >>>>>>>> > >>>>>>>> So the safest option would be the long-term solution: > >>>>>>>> > >>>>>>>> SELECT f1, f2, label FROM > >>>>>>>> ML_PREDICT( > >>>>>>>> input => TABLE(my_data), > >>>>>>>> model => MODEL(my_cat.my_db.classifier_model), > >>>>>>>> args => DESCRIPTOR(f1, f2)) > >>>>>>>> > >>>>>>>> But I'm fine with this if others have a strong opinion: > >>>>>>>> > >>>>>>>> SELECT f1, f2, label FROM > >>>>>>>> ML_PREDICT( > >>>>>>>> input => TABLE `my_data`, > >>>>>>>> model => my_cat.my_db.classifier_model, > >>>>>>>> args => DESCRIPTOR(f1, f2)) > >>>>>>>> > >>>>>>>> Some feedback for the remainder of the FLIP: > >>>>>>>> > >>>>>>>> 1) Simplify catalog objects > >>>>>>>> > >>>>>>>> I would suggest to drop: > >>>>>>>> CatalogModel.getModelKind() > >>>>>>>> CatalogModel.getModelTask() > >>>>>>>> > >>>>>>>> A catalog object should fully resemble the DDL. And since the DDL > >>>> puts > >>>>>>>> those properties in the WITH clause, the catalog object should > >> the > >>>> same > >>>>>>>> (i.e. put them into the `getModelOptions()`). Btw renaming this > >>>> method > >>>>>>>> to just `getOptions()` for consistency should be good as well. > >>>>>>>> Internally, we can still provide enums for these frequently used > >>>>>>>> classes. Similar to what we do in `FactoryUtil` for other > >>> frequently > >>>>>>>> used options. > >>>>>>>> > >>>>>>>> Remove `getDescription()` and `getDetailedDescription()`. They > >>> were a > >>>>>>>> mistake for CatalogTable and should actually be deprecated. They > >>> got > >>>>>>>> replaced by `getComment()` which is sufficient. > >>>>>>>> > >>>>>>>> 2) CREATE TEMPORARY MODEL is not supported. > >>>>>>>> > >>>>>>>> This is an unnecessary restriction. We should support temporary > >>>>> versions > >>>>>>>> of these catalog objects as well for consistency. Adding support > >>> for > >>>>>>>> this should be straightforward. > >>>>>>>> > >>>>>>>> 3) DESCRIBE | DESC } MODEL > >>> [catalog_name.][database_name.]model_name > >>>>>>>> > >>>>>>>> I would suggest we support `SHOW CREATE MODEL` instead. Similar > >> to > >>>>> `SHOW > >>>>>>>> CREATE TABLE`, this should show all properties. If we support > >>>> `DESCRIBE > >>>>>>>> MODEL` it should only list the input parameters similar to > >>> `DESCRIBE > >>>>>>>> TABLE` only shows the columns (not the WITH clause). > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Timo > >>>>>>>> > >>>>>>>> > >>>>>>>> On 23.03.24 13:17, Ahmed Hamdy wrote: > >>>>>>>>> Hi everyone, > >>>>>>>>> +1 for this proposal, I believe it is very useful to the > >> minimum, > >>> It > >>>>>>>> would > >>>>>>>>> be great even having "ML_PREDICT" and "ML_EVALUATE" as built-in > >>>> PTFs > >>>>> in > >>>>>>>>> this FLIP as discussed. > >>>>>>>>> IIUC this will be included in the 1.20 roadmap? > >>>>>>>>> Best Regards > >>>>>>>>> Ahmed Hamdy > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Fri, 22 Mar 2024 at 23:54, Hao Li <h...@confluent.io.invalid> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Timo and Jark, > >>>>>>>>>> > >>>>>>>>>> I agree Oracle's syntax seems concise and more descriptive. For > >>> the > >>>>>>>>>> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with > >>> Jark > >>>>> we > >>>>>>>> can > >>>>>>>>>> support them as built-in PTF using `SqlTableFunction` for this > >>>> FLIP. > >>>>>>>> We can > >>>>>>>>>> have a different FLIP discussing user defined PTF and adopt > >> that > >>>>> later > >>>>>>>> for > >>>>>>>>>> model functions later. To summarize, the current proposed > >> syntax > >>> is > >>>>>>>>>> > >>>>>>>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`, > >>>>>>>>>> `classifier_model`, f1, f2)) > >>>>>>>>>> > >>>>>>>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, > >>>>> `classifier_model`, > >>>>>>>> f1, > >>>>>>>>>> f2)) > >>>>>>>>>> > >>>>>>>>>> Is `DESCRIPTOR` a must in the syntax? If so, it becomes > >>>>>>>>>> > >>>>>>>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`, > >>>>>>>>>> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2))) > >>>>>>>>>> > >>>>>>>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`, > >>>>> `classifier_model`, > >>>>>>>>>> DESCRIPTOR(f1), DESCRIPTOR(f2))) > >>>>>>>>>> > >>>>>>>>>> If Calcite supports dropping outer table keyword, it becomes > >>>>>>>>>> > >>>>>>>>>> SELECT f1, f2, label FROM ML_PREDICT(TABLE `my_data`, > >>>>>>>> `classifier_model`, > >>>>>>>>>> DESCRIPTOR(f1), DESCRIPTOR(f2)) > >>>>>>>>>> > >>>>>>>>>> SELECT * FROM ML_EVALUATE(TABLE `eval_data`, > >> `classifier_model`, > >>>>>>>>>> DESCRIPTOR( > >>>>>>>>>> f1), DESCRIPTOR(f2)) > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Hao > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Fri, Mar 22, 2024 at 9:16 AM Jark Wu <imj...@gmail.com> > >>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Sorry, I mean we can bump the Calcite version if needed in > >> Flink > >>>>> 1.20. > >>>>>>>>>>> > >>>>>>>>>>> On Fri, 22 Mar 2024 at 22:19, Jark Wu <imj...@gmail.com> > >> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Timo, > >>>>>>>>>>>> > >>>>>>>>>>>> Introducing user-defined PTF is very useful in Flink, I'm +1 > >>> for > >>>>>>>> this. > >>>>>>>>>>>> But I think the ML model FLIP is not blocked by this, because > >>> we > >>>>>>>>>>>> can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs > >>>>>>>>>>>> just like TUMBLE/HOP. And support user-defined ML functions > >> as > >>>>>>>>>>>> a future FLIP. > >>>>>>>>>>>> > >>>>>>>>>>>> Regarding the simplified PTF syntax which reduces the outer > >>>> TABLE() > >>>>>>>>>>>> keyword, > >>>>>>>>>>>> it seems it was just supported[1] by the Calcite community > >> last > >>>>> month > >>>>>>>>>> and > >>>>>>>>>>>> will be > >>>>>>>>>>>> released in the next version (v1.37). The Calcite community > >> is > >>>>>>>>>> preparing > >>>>>>>>>>>> the > >>>>>>>>>>>> 1.37 release, so we can bump the version if needed in Flink > >>> 1.19. > >>>>>>>>>>>> > >>>>>>>>>>>> Best, > >>>>>>>>>>>> Jark > >>>>>>>>>>>> > >>>>>>>>>>>> [1]: https://issues.apache.org/jira/browse/CALCITE-6254 > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, 22 Mar 2024 at 21:46, Timo Walther < > >> twal...@apache.org > >>>> > >>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi everyone, > >>>>>>>>>>>>> > >>>>>>>>>>>>> this is a very important change to the Flink SQL syntax but > >> we > >>>>> can't > >>>>>>>>>>>>> wait until the SQL standard is ready for this. So I'm +1 on > >>>>>>>>>> introducing > >>>>>>>>>>>>> the MODEL concept as a first class citizen in Flink. > >>>>>>>>>>>>> > >>>>>>>>>>>>> For your information: Over the past months I have already > >>> spent > >>>> a > >>>>>>>>>>>>> significant amount of time thinking about how we can > >> introduce > >>>>> PTFs > >>>>>>>> in > >>>>>>>>>>>>> Flink. I reserved FLIP-440[1] for this purpose and I will > >>> share > >>>> a > >>>>>>>>>>>>> version of this in the next 1-2 weeks. > >>>>>>>>>>>>> > >>>>>>>>>>>>> For a good implementation of FLIP-440 and also FLIP-437, we > >>>> should > >>>>>>>>>>>>> evolve the PTF syntax in collaboration with Apache Calcite. > >>>>>>>>>>>>> > >>>>>>>>>>>>> There are different syntax versions out there: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1) Flink > >>>>>>>>>>>>> > >>>>>>>>>>>>> SELECT * FROM > >>>>>>>>>>>>> TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > >>> '10' > >>>>>>>>>> MINUTES)); > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2) SQL standard > >>>>>>>>>>>>> > >>>>>>>>>>>>> SELECT * FROM > >>>>>>>>>>>>> TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL > >>> '10' > >>>>>>>>>>> MINUTES)); > >>>>>>>>>>>>> > >>>>>>>>>>>>> 3) Oracle > >>>>>>>>>>>>> > >>>>>>>>>>>>> SELECT * FROM > >>>>>>>>>>>>> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES)); > >>>>>>>>>>>>> > >>>>>>>>>>>>> As you can see above, Flink does not follow the standard > >>>> correctly > >>>>>>>> as > >>>>>>>>>> it > >>>>>>>>>>>>> would need to use `TABLE()` but this is not provided by > >>> Calcite > >>>>> yet. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I really like the Oracle syntax[2][3] a lot. It reduces > >>>> necessary > >>>>>>>>>>>>> keywords to a minimum. Personally, I would like to discuss > >>> this > >>>>>>>> syntax > >>>>>>>>>>>>> in a separate FLIP and hope I will find supporters for: > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> SELECT * FROM > >>>>>>>>>>>>> TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' > >>>>> MINUTES); > >>>>>>>>>>>>> > >>>>>>>>>>>>> If we go entirely with the Oracle syntax, as you can see in > >>> the > >>>>>>>>>> example, > >>>>>>>>>>>>> Oracle allows for passing identifiers directly. This would > >>> solve > >>>>> our > >>>>>>>>>>>>> problems for the MODEL as well: > >>>>>>>>>>>>> > >>>>>>>>>>>>> SELECT f1, f2, label FROM ML_PREDICT( > >>>>>>>>>>>>> data => `my_data`, > >>>>>>>>>>>>> model => `classifier_model`, > >>>>>>>>>>>>> input => DESCRIPTOR(f1, f2)); > >>>>>>>>>>>>> > >>>>>>>>>>>>> Or we completely adopt the Oracle syntax: > >>>>>>>>>>>>> > >>>>>>>>>>>>> SELECT f1, f2, label FROM ML_PREDICT( > >>>>>>>>>>>>> data => `my_data`, > >>>>>>>>>>>>> model => `classifier_model`, > >>>>>>>>>>>>> input => COLUMNS(f1, f2)); > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> What do you think? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Happy to create a FLIP for just this syntax question and > >>>>> collaborate > >>>>>>>>>>>>> with the Calcite community on this. Supporting the syntax of > >>>>> Oracle > >>>>>>>>>>>>> shouldn't be too hard to convince at least as parser > >>> parameter. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Timo > >>>>>>>>>>>>> > >>>>>>>>>>>>> [1] > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions > >>>>>>>>>>>>> [2] > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072 > >>>>>>>>>>>>> [3] > >>>>>>>>>>> > >>>>> https://oracle-base.com/articles/18c/polymorphic-table-functions-18c > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 20.03.24 17:22, Mingge Deng wrote: > >>>>>>>>>>>>>> Thanks Jark for all the insightful comments. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We have updated the proposal per our offline discussions: > >>>>>>>>>>>>>> 1. Model will be treated as a new relation in FlinkSQL. > >>>>>>>>>>>>>> 2. Include the common ML predict and evaluate functions > >> into > >>>> the > >>>>>>>>>> open > >>>>>>>>>>>>>> source flink to complete the user journey. > >>>>>>>>>>>>>> And we should be able to extend the calcite > >>>>> SqlTableFunction > >>>>>>>> to > >>>>>>>>>>>>> support > >>>>>>>>>>>>>> these two ML functions. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> Mingge > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Mar 18, 2024 at 7:05 PM Jark Wu <imj...@gmail.com> > >>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Hao, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I meant how the table name > >>>>>>>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`. > >>> Probably > >>>>> we > >>>>>>>>>>> need > >>>>>>>>>>>>> to > >>>>>>>>>>>>>>> fetch the table definition from the catalog somewhere. Do > >> we > >>>>> treat > >>>>>>>>>>>>> those > >>>>>>>>>>>>>>> window TVF specially in parser/planner so that catalog is > >>>> looked > >>>>>>>> up > >>>>>>>>>>>>> when > >>>>>>>>>>>>>>> they are seen? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The table names are resolved and validated by Calcite > >>>>>>>> SqlValidator. > >>>>>>>>>>> We > >>>>>>>>>>>>>>> don' need to fetch from catalog manually. > >>>>>>>>>>>>>>> The specific checking logic of cumulate window happens in > >>>>>>>>>>>>>>> > >>>> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes. > >>>>>>>>>>>>>>> The return type of SqlCumulateTableFunction is defined in > >>>>>>>>>>>>>>> #getRowTypeInference() method. > >>>>>>>>>>>>>>> Both are public interfaces provided by Calcite and it > >> seems > >>>> it's > >>>>>>>>>> not > >>>>>>>>>>>>>>> specially handled in parser/planner. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I didn't try that, but my gut feeling is that the > >> framework > >>> is > >>>>>>>>>> ready > >>>>>>>>>>> to > >>>>>>>>>>>>>>> extend a customized TVF. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> For what model is, I'm wondering if it has to be datatype > >>> or > >>>>>>>>>>> relation. > >>>>>>>>>>>>>>> Can > >>>>>>>>>>>>>>> it be another kind of citizen parallel to > >>>>>>>>>>>>> datatype/relation/function/db? > >>>>>>>>>>>>>>> Redshift also supports `show models` operation, so it > >> seems > >>>> it's > >>>>>>>>>>>>> treated > >>>>>>>>>>>>>>> specially as well? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If it is an entity only used in catalog scope (e.g., show > >>> xxx, > >>>>>>>>>> create > >>>>>>>>>>>>> xxx, > >>>>>>>>>>>>>>> drop xxx), it is fine to introduce it. > >>>>>>>>>>>>>>> We have introduced such one before, called Module: "load > >>>>> module", > >>>>>>>>>>> "show > >>>>>>>>>>>>>>> modules" [1]. > >>>>>>>>>>>>>>> But if we want to use Model in TVF parameters, it means it > >>> has > >>>>> to > >>>>>>>>>> be > >>>>>>>>>>> a > >>>>>>>>>>>>>>> relation or datatype, because > >>>>>>>>>>>>>>> that is what it only accepts now. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks for sharing the reason of preferring TVF instead of > >>>>>>>> Redshift > >>>>>>>>>>>>> way. It > >>>>>>>>>>>>>>> sounds reasonable to me. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/ > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, 15 Mar 2024 at 13:41, Hao Li > >>> <h...@confluent.io.invalid > >>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi Jark, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks for the pointer. Sorry for the confusion: I meant > >>> how > >>>>> the > >>>>>>>>>>> table > >>>>>>>>>>>>>>> name > >>>>>>>>>>>>>>>> in window TVF gets translated to `SqlCallingBinding`. > >>>> Probably > >>>>> we > >>>>>>>>>>>>> need to > >>>>>>>>>>>>>>>> fetch the table definition from the catalog somewhere. Do > >>> we > >>>>>>>> treat > >>>>>>>>>>>>> those > >>>>>>>>>>>>>>>> window TVF specially in parser/planner so that catalog is > >>>>> looked > >>>>>>>>>> up > >>>>>>>>>>>>> when > >>>>>>>>>>>>>>>> they are seen? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> For what model is, I'm wondering if it has to be datatype > >>> or > >>>>>>>>>>> relation. > >>>>>>>>>>>>>>> Can > >>>>>>>>>>>>>>>> it be another kind of citizen parallel to > >>>>>>>>>>>>> datatype/relation/function/db? > >>>>>>>>>>>>>>>> Redshift also supports `show models` operation, so it > >> seems > >>>>> it's > >>>>>>>>>>>>> treated > >>>>>>>>>>>>>>>> specially as well? The reasons I don't like Redshift's > >>> syntax > >>>>>>>> are: > >>>>>>>>>>>>>>>> 1. It's a bit verbose, you need to think of a model name > >> as > >>>>> well > >>>>>>>>>> as > >>>>>>>>>>> a > >>>>>>>>>>>>>>>> function name and the function name also needs to be > >>> unique. > >>>>>>>>>>>>>>>> 2. More importantly, prediction function isn't the only > >>>>> function > >>>>>>>>>>> that > >>>>>>>>>>>>> can > >>>>>>>>>>>>>>>> operate on models. There could be a set of inference > >>>> functions > >>>>>>>> [1] > >>>>>>>>>>> and > >>>>>>>>>>>>>>>> evaluation functions [2] which can operate on models. > >> It's > >>>> hard > >>>>>>>> to > >>>>>>>>>>>>>>> specify > >>>>>>>>>>>>>>>> all of them in model creation. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict > >>>>>>>>>>>>>>>> [2]: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>> Hao > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Thu, Mar 14, 2024 at 8:18 PM Jark Wu < > >> imj...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi Hao, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Can you send me some pointers > >>>>>>>>>>>>>>>>> where the function gets the table information? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Here is the code of cumulate window type checking [1]. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Also is it possible to support <query_stmt> in > >>>>>>>>>>>>>>>>> window functions in addiction to table? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Yes. It is not allowed in TVF. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks for the syntax links of other systems. The > >> reason I > >>>>>>>> prefer > >>>>>>>>>>> the > >>>>>>>>>>>>>>>>> Redshift way is > >>>>>>>>>>>>>>>>> that it avoids introducing Model as a relation or > >> datatype > >>>>>>>>>>>>> (referenced > >>>>>>>>>>>>>>>> as a > >>>>>>>>>>>>>>>>> parameter in TVF). > >>>>>>>>>>>>>>>>> Model is not a relation because it can be queried > >> directly > >>>>>>>> (e.g., > >>>>>>>>>>>>>>> SELECT > >>>>>>>>>>>>>>>> * > >>>>>>>>>>>>>>>>> FROM model). > >>>>>>>>>>>>>>>>> I'm also confused about making Model as a datatype, > >>> because > >>>> I > >>>>>>>>>> don't > >>>>>>>>>>>>>>> know > >>>>>>>>>>>>>>>>> what class the > >>>>>>>>>>>>>>>>> model parameter of the eval method of > >>>>>>>>>> TableFunction/ScalarFunction > >>>>>>>>>>>>>>> should > >>>>>>>>>>>>>>>>> be. By defining > >>>>>>>>>>>>>>>>> the function with the model, users can directly invoke > >> the > >>>>>>>>>> function > >>>>>>>>>>>>>>>> without > >>>>>>>>>>>>>>>>> reference to the model name. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Fri, 15 Mar 2024 at 02:48, Hao Li > >>>> <h...@confluent.io.invalid > >>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Jark, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks for the pointers. It's very helpful. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 1. Looks like `tumble`, `hopping` are keywords in > >> calcite > >>>>>>>>>> parser. > >>>>>>>>>>>>> And > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> syntax `cumulate(Table my_table, ...)` needs to get > >> table > >>>>>>>>>>>>> information > >>>>>>>>>>>>>>>>> from > >>>>>>>>>>>>>>>>>> catalog somewhere for type validation etc. Can you send > >>> me > >>>>> some > >>>>>>>>>>>>>>>> pointers > >>>>>>>>>>>>>>>>>> where the function gets the table information? > >>>>>>>>>>>>>>>>>> 2. The ideal syntax for model function I think would be > >>>>>>>>>>>>>>>> `ML_PREDICT(MODEL > >>>>>>>>>>>>>>>>>> <model_name>, {table <table_name> | (query_stmt) })`. I > >>>> think > >>>>>>>>>> with > >>>>>>>>>>>>>>>>> special > >>>>>>>>>>>>>>>>>> handling of the `ML_PREDICT` function in > >> parser/planner, > >>>>> maybe > >>>>>>>>>> we > >>>>>>>>>>>>> can > >>>>>>>>>>>>>>>> do > >>>>>>>>>>>>>>>>>> this like window functions. But to support `MODEL` > >>> keyword, > >>>>> we > >>>>>>>>>>> need > >>>>>>>>>>>>>>>>> calcite > >>>>>>>>>>>>>>>>>> parser change I guess. Also is it possible to support > >>>>>>>>>> <query_stmt> > >>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>> window functions in addiction to table? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> For the redshift syntax, I'm not sure the purpose of > >>>> defining > >>>>>>>>>> the > >>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>> name with the model. Is it to define the function > >>>>> input/output > >>>>>>>>>>>>>>> schema? > >>>>>>>>>>>>>>>> We > >>>>>>>>>>>>>>>>>> have the schema in our create model syntax and the > >>>>> `ML_PREDICT` > >>>>>>>>>>> can > >>>>>>>>>>>>>>>>> handle > >>>>>>>>>>>>>>>>>> it by getting model definition. I think our syntax is > >>> more > >>>>>>>>>> concise > >>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>> have > >>>>>>>>>>>>>>>>>> a generic prediction function. I also did some research > >>> and > >>>>>>>> it's > >>>>>>>>>>> the > >>>>>>>>>>>>>>>>> syntax > >>>>>>>>>>>>>>>>>> used by Databricks `ai_query` [1], Snowflake `predict` > >>> [2], > >>>>>>>>>>> Azureml > >>>>>>>>>>>>>>>>>> `predict` [3]. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html > >>>>>>>>>>>>>>>>>> [2]: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0 > >>>>>>>>>>>>>>>>>> [3]: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>> Hao > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Wed, Mar 13, 2024 at 8:57 PM Jark Wu < > >>> imj...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi Mingge, Hao, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks for your replies. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> PTF is actually the ideal approach for model > >> functions, > >>>> and > >>>>>>>> we > >>>>>>>>>>> do > >>>>>>>>>>>>>>>>> have > >>>>>>>>>>>>>>>>>>> the plans to use PTF for > >>>>>>>>>>>>>>>>>>> all model functions (including prediction, evaluation > >>>> etc..) > >>>>>>>>>> once > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> PTF > >>>>>>>>>>>>>>>>>>> is supported in FlinkSQL > >>>>>>>>>>>>>>>>>>> confluent extension. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> It sounds that PTF is the ideal way and table function > >>> is > >>>> a > >>>>>>>>>>>>>>> temporary > >>>>>>>>>>>>>>>>>>> solution which will be dropped in the future. > >>>>>>>>>>>>>>>>>>> I'm not sure whether we can implement it using PTF in > >>>> Flink > >>>>>>>>>> SQL. > >>>>>>>>>>>>>>> But > >>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>>>> have implemented window > >>>>>>>>>>>>>>>>>>> functions using PTF[1]. And introduced a new window > >>>> function > >>>>>>>>>>>>>>> (called > >>>>>>>>>>>>>>>>>>> CUMULATE[2]) in Flink SQL based > >>>>>>>>>>>>>>>>>>> on this. I think it might work to use PTF and > >> implement > >>>>> model > >>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>>> syntax like this: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> SELECT * FROM TABLE(ML_PREDICT( > >>>>>>>>>>>>>>>>>>> TABLE my_table, > >>>>>>>>>>>>>>>>>>> my_model, > >>>>>>>>>>>>>>>>>>> col1, > >>>>>>>>>>>>>>>>>>> col2 > >>>>>>>>>>>>>>>>>>> )); > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Besides, did you consider following the way of AWS > >>>> Redshift > >>>>>>>>>> which > >>>>>>>>>>>>>>>>> defines > >>>>>>>>>>>>>>>>>>> model function with the model itself together? > >>>>>>>>>>>>>>>>>>> IIUC, a model is a black-box which defines input > >>>> parameters > >>>>>>>> and > >>>>>>>>>>>>>>>> output > >>>>>>>>>>>>>>>>>>> parameters which can be modeled into functions. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session > >>>>>>>>>>>>>>>>>>> [2]: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows > >>>>>>>>>>>>>>>>>>> [3]: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 15:00, Hao Li > >>>>> <h...@confluent.io.invalid > >>>>>>>>>>> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Hi Jark, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks for your questions. These are good questions! > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 1. The polymorphism table function I was referring to > >>>>> takes a > >>>>>>>>>>>>>>> table > >>>>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>>>>> input and outputs a table. So the syntax would be > >> like > >>>>>>>>>>>>>>>>>>>> ``` > >>>>>>>>>>>>>>>>>>>> SELECT * FROM ML_PREDICT('model', (SELECT * FROM > >>>> my_table)) > >>>>>>>>>>>>>>>>>>>> ``` > >>>>>>>>>>>>>>>>>>>> As far as I know, this is not supported yet on Flink. > >>> So > >>>>>>>>>> before > >>>>>>>>>>>>>>>> it's > >>>>>>>>>>>>>>>>>>>> supported, one option for the predict function is > >> using > >>>>> table > >>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>>>> which can output multiple columns > >>>>>>>>>>>>>>>>>>>> ``` > >>>>>>>>>>>>>>>>>>>> SELECT * FROM my_table, LATERAL VIEW > >>> (ML_PREDICT('model', > >>>>>>>>>> col1, > >>>>>>>>>>>>>>>>> col2)) > >>>>>>>>>>>>>>>>>>>> ``` > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 2. Good question. Type inference is hard for the > >>>>> `ML_PREDICT` > >>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>>>> because it takes a model name string as input. I can > >>>> think > >>>>> of > >>>>>>>>>>>>>>> three > >>>>>>>>>>>>>>>>>> ways > >>>>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>> doing type inference for it. > >>>>>>>>>>>>>>>>>>>> 1). Treat `ML_PREDICT` function as something > >>>> special > >>>>> and > >>>>>>>>>>>>>>> during > >>>>>>>>>>>>>>>>> sql > >>>>>>>>>>>>>>>>>>>> parsing or planning time, if it's encountered, we > >> need > >>> to > >>>>>>>> look > >>>>>>>>>>> up > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> model > >>>>>>>>>>>>>>>>>>>> from the first argument which is a model name from > >>>> catalog. > >>>>>>>>>> Then > >>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>> can > >>>>>>>>>>>>>>>>>>>> infer the input/output for the function. > >>>>>>>>>>>>>>>>>>>> 2). We can define a `model` keyword and use > >> that > >>> in > >>>>> the > >>>>>>>>>>>>>>> predict > >>>>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>>>> to indicate the argument refers to a model. So it's > >>> like > >>>>>>>>>>>>>>>>>>> `ML_PREDICT(model > >>>>>>>>>>>>>>>>>>>> 'my_model', col1, col2))` > >>>>>>>>>>>>>>>>>>>> 3). We can create a special type of table > >>> function > >>>>> maybe > >>>>>>>>>>>>>>> called > >>>>>>>>>>>>>>>>>>>> `ModelFunction` which can resolve the model type > >>>> inference > >>>>> by > >>>>>>>>>>>>>>>> special > >>>>>>>>>>>>>>>>>>>> handling it during parsing or planning time. > >>>>>>>>>>>>>>>>>>>> 1) is hacky, 2) isn't supported in Flink for > >> function, > >>> 3) > >>>>>>>>>> might > >>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>> good option. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 3. I sketched the `ML_PREDICT` function for > >> inference. > >>>> But > >>>>>>>>>> there > >>>>>>>>>>>>>>>> are > >>>>>>>>>>>>>>>>>>>> limitations of the function mentioned in 1 and 2. So > >>>> maybe > >>>>> we > >>>>>>>>>>>>>>> don't > >>>>>>>>>>>>>>>>>> need > >>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>>> introduce them as built-in functions until > >> polymorphism > >>>>> table > >>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>>> we can properly deal with type inference. > >>>>>>>>>>>>>>>>>>>> After that, defining a user-defined model function > >>> should > >>>>>>>> also > >>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>> straightforward. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 4. For model types, do you mean 'remote', 'import', > >>>>> 'native' > >>>>>>>>>>>>>>> models > >>>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>>>>>> other things? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 5. We could support popular providers such as > >>> 'azureml', > >>>>>>>>>>>>>>>> 'vertexai', > >>>>>>>>>>>>>>>>>>>> 'googleai' as long as we support the `ML_PREDICT` > >>>> function. > >>>>>>>>>>> Users > >>>>>>>>>>>>>>>>>> should > >>>>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>> able to implement 3rd-party providers if they can > >>>>> implement a > >>>>>>>>>>>>>>>>> function > >>>>>>>>>>>>>>>>>>>> handling the input/output for the provider. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I think for the model functions, there are still > >>>>> dependencies > >>>>>>>>>> or > >>>>>>>>>>>>>>>>> hacks > >>>>>>>>>>>>>>>>>> we > >>>>>>>>>>>>>>>>>>>> need to sort out as a built-in function. Maybe we can > >>>>>>>> separate > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>> follow up if we want to have it built-in and focus on > >>> the > >>>>>>>>>> model > >>>>>>>>>>>>>>>>> syntax > >>>>>>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>>>>> this FLIP? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>> Hao > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Tue, Mar 12, 2024 at 10:33 PM Jark Wu < > >>>> imj...@gmail.com > >>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Hi Minge, Chris, Hao, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Thanks for proposing this interesting idea. I think > >>> this > >>>>> is > >>>>>>>> a > >>>>>>>>>>>>>>>> nice > >>>>>>>>>>>>>>>>>> step > >>>>>>>>>>>>>>>>>>>>> towards > >>>>>>>>>>>>>>>>>>>>> the AI world for Apache Flink. I don't know much > >> about > >>>>>>>> AI/ML, > >>>>>>>>>>>>>>> so > >>>>>>>>>>>>>>>> I > >>>>>>>>>>>>>>>>>> may > >>>>>>>>>>>>>>>>>>>> have > >>>>>>>>>>>>>>>>>>>>> some stupid questions. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 1. Could you tell more about why polymorphism table > >>>>> function > >>>>>>>>>>>>>>>> (PTF) > >>>>>>>>>>>>>>>>>>>> doesn't > >>>>>>>>>>>>>>>>>>>>> work and do we have plan to use PTF as model > >>> functions? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 2. What kind of object does the model map to in > >> SQL? A > >>>>>>>>>> relation > >>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>> data > >>>>>>>>>>>>>>>>>>>>> type? > >>>>>>>>>>>>>>>>>>>>> It looks like a data type because we use it as a > >>>> parameter > >>>>>>>> of > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> table > >>>>>>>>>>>>>>>>>>>>> function. > >>>>>>>>>>>>>>>>>>>>> If it is a data type, how does it cooperate with > >> type > >>>>>>>>>>>>>>>> inference[1]? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 3. What built-in model functions will we support? > >> How > >>> to > >>>>>>>>>>>>>>> define a > >>>>>>>>>>>>>>>>>>>>> user-defined model function? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 4. What built-in model types will we support? How to > >>>>> define > >>>>>>>> a > >>>>>>>>>>>>>>>>>>>> user-defined > >>>>>>>>>>>>>>>>>>>>> model type? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> 5. Regarding the remote model, what providers will > >> we > >>>>>>>>>> support? > >>>>>>>>>>>>>>>> Can > >>>>>>>>>>>>>>>>>>> users > >>>>>>>>>>>>>>>>>>>>> implement > >>>>>>>>>>>>>>>>>>>>> 3rd-party providers except OpenAI? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Wed, 13 Mar 2024 at 05:55, Hao Li > >>>>>>>>>> <h...@confluent.io.invalid > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Hi, Dev > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Mingge, Chris and I would like to start a > >> discussion > >>>>> about > >>>>>>>>>>>>>>>>>> FLIP-437: > >>>>>>>>>>>>>>>>>>>>>> Support ML Models in Flink SQL. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> This FLIP is proposing to support machine learning > >>>> models > >>>>>>>> in > >>>>>>>>>>>>>>>>> Flink > >>>>>>>>>>>>>>>>>>> SQL > >>>>>>>>>>>>>>>>>>>>>> syntax so that users can CRUD models with Flink SQL > >>> and > >>>>> use > >>>>>>>>>>>>>>>>> models > >>>>>>>>>>>>>>>>>> on > >>>>>>>>>>>>>>>>>>>>> Flink > >>>>>>>>>>>>>>>>>>>>>> to do prediction with Flink data. The FLIP also > >>>> proposes > >>>>>>>> new > >>>>>>>>>>>>>>>>> model > >>>>>>>>>>>>>>>>>>>>> entities > >>>>>>>>>>>>>>>>>>>>>> and changes to catalog interface to support model > >>> CRUD > >>>>>>>>>>>>>>>> operations > >>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>>>>>> catalog. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> For more details, see FLIP-437 [1]. Looking forward > >>> to > >>>>> your > >>>>>>>>>>>>>>>>>> feedback. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>> Minge, Chris & Hao > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > > >