Hi Hao, > Can you send me some pointers where the function gets the table information?
Here is the code of cumulate window type checking [1]. > Also is it possible to support <query_stmt> in window functions in addiction to table? Yes. It is not allowed in TVF. Thanks for the syntax links of other systems. The reason I prefer the Redshift way is that it avoids introducing Model as a relation or datatype (referenced as a parameter in TVF). Model is not a relation because it can be queried directly (e.g., SELECT * FROM model). I'm also confused about making Model as a datatype, because I don't know what class the model parameter of the eval method of TableFunction/ScalarFunction should be. By defining the function with the model, users can directly invoke the function without reference to the model name. Best, Jark [1]: https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53 On Fri, 15 Mar 2024 at 02:48, Hao Li <h...@confluent.io.invalid> wrote: > Hi Jark, > > Thanks for the pointers. It's very helpful. > > 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And the > syntax `cumulate(Table my_table, ...)` needs to get table information from > catalog somewhere for type validation etc. Can you send me some pointers > where the function gets the table information? > 2. The ideal syntax for model function I think would be `ML_PREDICT(MODEL > <model_name>, {table <table_name> | (query_stmt) })`. I think with special > handling of the `ML_PREDICT` function in parser/planner, maybe we can do > this like window functions. But to support `MODEL` keyword, we need calcite > parser change I guess. Also is it possible to support <query_stmt> in > window functions in addiction to table? > > For the redshift syntax, I'm not sure the purpose of defining the function > name with the model. Is it to define the function input/output schema? We > have the schema in our create model syntax and the `ML_PREDICT` can handle > it by getting model definition. I think our syntax is more concise to have > a generic prediction function. I also did some research and it's the syntax > used by Databricks `ai_query` [1], Snowflake `predict` [2], Azureml > `predict` [3]. > > [1]: > https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html > [2]: > > https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0 > [3]: > > https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current > > Thanks, > Hao > > On Wed, Mar 13, 2024 at 8:57 PM Jark Wu <imj...@gmail.com> wrote: > > > Hi Mingge, Hao, > > > > Thanks for your replies. > > > > > PTF is actually the ideal approach for model functions, and we do have > > the plans to use PTF for > > all model functions (including prediction, evaluation etc..) once the PTF > > is supported in FlinkSQL > > confluent extension. > > > > It sounds that PTF is the ideal way and table function is a temporary > > solution which will be dropped in the future. > > I'm not sure whether we can implement it using PTF in Flink SQL. But we > > have implemented window > > functions using PTF[1]. And introduced a new window function (called > > CUMULATE[2]) in Flink SQL based > > on this. I think it might work to use PTF and implement model function > > syntax like this: > > > > SELECT * FROM TABLE(ML_PREDICT( > > TABLE my_table, > > my_model, > > col1, > > col2 > > )); > > > > Besides, did you consider following the way of AWS Redshift which defines > > model function with the model itself together? > > IIUC, a model is a black-box which defines input parameters and output > > parameters which can be modeled into functions. > > > > > > Best, > > Jark > > > > [1]: > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session > > [2]: > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows > > [3]: > > > > > https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model > > > > > > > > > > On Wed, 13 Mar 2024 at 15:00, Hao Li <h...@confluent.io.invalid> wrote: > > > > > Hi Jark, > > > > > > Thanks for your questions. These are good questions! > > > > > > 1. The polymorphism table function I was referring to takes a table as > > > input and outputs a table. So the syntax would be like > > > ``` > > > SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table)) > > > ``` > > > As far as I know, this is not supported yet on Flink. So before it's > > > supported, one option for the predict function is using table function > > > which can output multiple columns > > > ``` > > > SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model', col1, col2)) > > > ``` > > > > > > 2. Good question. Type inference is hard for the `ML_PREDICT` function > > > because it takes a model name string as input. I can think of three > ways > > of > > > doing type inference for it. > > > 1). Treat `ML_PREDICT` function as something special and during sql > > > parsing or planning time, if it's encountered, we need to look up the > > model > > > from the first argument which is a model name from catalog. Then we can > > > infer the input/output for the function. > > > 2). We can define a `model` keyword and use that in the predict > > function > > > to indicate the argument refers to a model. So it's like > > `ML_PREDICT(model > > > 'my_model', col1, col2))` > > > 3). We can create a special type of table function maybe called > > > `ModelFunction` which can resolve the model type inference by special > > > handling it during parsing or planning time. > > > 1) is hacky, 2) isn't supported in Flink for function, 3) might be a > > > good option. > > > > > > 3. I sketched the `ML_PREDICT` function for inference. But there are > > > limitations of the function mentioned in 1 and 2. So maybe we don't > need > > to > > > introduce them as built-in functions until polymorphism table function > > and > > > we can properly deal with type inference. > > > After that, defining a user-defined model function should also be > > > straightforward. > > > > > > 4. For model types, do you mean 'remote', 'import', 'native' models or > > > other things? > > > > > > 5. We could support popular providers such as 'azureml', 'vertexai', > > > 'googleai' as long as we support the `ML_PREDICT` function. Users > should > > be > > > able to implement 3rd-party providers if they can implement a function > > > handling the input/output for the provider. > > > > > > I think for the model functions, there are still dependencies or hacks > we > > > need to sort out as a built-in function. Maybe we can separate that as > a > > > follow up if we want to have it built-in and focus on the model syntax > > for > > > this FLIP? > > > > > > Thanks, > > > Hao > > > > > > On Tue, Mar 12, 2024 at 10:33 PM Jark Wu <imj...@gmail.com> wrote: > > > > > > > Hi Minge, Chris, Hao, > > > > > > > > Thanks for proposing this interesting idea. I think this is a nice > step > > > > towards > > > > the AI world for Apache Flink. I don't know much about AI/ML, so I > may > > > have > > > > some stupid questions. > > > > > > > > 1. Could you tell more about why polymorphism table function (PTF) > > > doesn't > > > > work and do we have plan to use PTF as model functions? > > > > > > > > 2. What kind of object does the model map to in SQL? A relation or a > > data > > > > type? > > > > It looks like a data type because we use it as a parameter of the > table > > > > function. > > > > If it is a data type, how does it cooperate with type inference[1]? > > > > > > > > 3. What built-in model functions will we support? How to define a > > > > user-defined model function? > > > > > > > > 4. What built-in model types will we support? How to define a > > > user-defined > > > > model type? > > > > > > > > 5. Regarding the remote model, what providers will we support? Can > > users > > > > implement > > > > 3rd-party providers except OpenAI? > > > > > > > > Best, > > > > Jark > > > > > > > > [1]: > > > > > > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference > > > > > > > > > > > > > > > > > > > > On Wed, 13 Mar 2024 at 05:55, Hao Li <h...@confluent.io.invalid> > wrote: > > > > > > > > > Hi, Dev > > > > > > > > > > > > > > > Mingge, Chris and I would like to start a discussion about > FLIP-437: > > > > > Support ML Models in Flink SQL. > > > > > > > > > > This FLIP is proposing to support machine learning models in Flink > > SQL > > > > > syntax so that users can CRUD models with Flink SQL and use models > on > > > > Flink > > > > > to do prediction with Flink data. The FLIP also proposes new model > > > > entities > > > > > and changes to catalog interface to support model CRUD operations > in > > > > > catalog. > > > > > > > > > > For more details, see FLIP-437 [1]. Looking forward to your > feedback. > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL > > > > > > > > > > Thanks, > > > > > Minge, Chris & Hao > > > > > > > > > > > > > > >