Hi Dian, Thanks for the review! Please see my reply inline.
Regards, Dong On Wed, Feb 8, 2023 at 11:58 AM Dian Fu <dian0511...@gmail.com> wrote: > Hi Dong, > > > Thanks for driving this effort! This FLIP LGTM overall. I have just a few > minor comments regarding the proposed API: > > 1) For the method `DataFrame.collect()`, why is it named `collect` instead > of something else, e.g. `get`? Does it mean that the result will be > computed during this method call? > I have chosen the names to be closer to the names of similar classes or concepts in the existing popular open-source projects. The reasons for using `collect` include: - mleap is a popular machine learning serving framework and it uses LeapFrame#collect() <https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/runtime/frame/LeapFrame.scala#L13> to represent this concept. - Collect rows from table seems close to the concept of TableResult#collect() in Flink. - The method returns a *collection* of rows from this table, which makes `collect()` is bit more relevant than `get()`. Currently, I expect DataFrame.collect() to just return the values already computed before this method is called. > 2) For the method `DataFrame.addColumn`, when will this method be used? > For example, we might want to implement KMeansModelServable#transform that appends the prediction result to the input dataFrame. This is similar to the existing way that KMeansModel#transform <https://github.com/apache/flink-ml/blob/master/flink-ml-lib/src/main/java/org/apache/flink/ml/clustering/kmeans/KMeansModel.java#L84> appends the result to the input Table. Note that we probably don't want to copy the entire input DataFrame to the outpuDataFrame. We will need to use DataFrame.addColumn to append the prediction result in this case. > 3) In the example `runOnlineInferenceOnWebServer`, it uses > `output_df.getDataType("output")` to get the result type. It's not quite > intuitive. Does it make sense to add a method `getDataType()` in DataFrame? I guess you are asking whether we should just have a method getDataType() that returns types of all columns of the dataframe. Here is a scenario where users and algorithms will read only selected columns from DataFrame: - Users provided a DataFrame with columnA and columnB - The 1st algorithm will read columnA and columnB, then computes/appends columnC to the DataFrame. - The 2nd algorithm will read columnB and columnC, then computes/appends columnD to the DataFrame. - Users need to read columnD in the final DataFrame. > Regards, > Dian > > > > On Tue, Feb 7, 2023 at 12:37 PM Dong Lin <lindon...@gmail.com> wrote: > > > Hi all, > > > > If there is no question related to this FLIP, we will start the voting > > thread on 2/10. > > > > Regards, > > Dong > > > > On Wed, Feb 1, 2023 at 8:38 PM Dong Lin <lindon...@gmail.com> wrote: > > > > > Hi all, > > > > > > Fan, Jiang, Zhipeng, and I have created FLIP-289: Support online > > inference > > > (Flink ML). > > > > > > The goal of this FLIP is to enable users to use the model trained by > > Flink > > > ML to do online inference. More details can be found in the FLIP doc at > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240881268 > > > . > > > > > > We are looking forward to your comments. > > > > > > Regards, > > > Dong > > > > > >