c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the interface of ML pipeline and ML lib URL: https://github.com/apache/flink/pull/8402#discussion_r286336701
########## File path: flink-ml/flink-ml-api/src/main/java/org/apache/flink/ml/api/misc/persist/Persistable.java ########## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.api.misc.persist; + +/** + * An interface to allow PipelineStage persistence and reload. As of now, we are using JSON as + * format. + */ +public interface Persistable { Review comment: Thanks for your explain. I found this example code in SparkML. It’s mainly use toDebugString method to acquire the info. But since I read the code, I found that the toDebugString method is provided only by a few models, most of which are tree models. Here are all models I found that has this function or something like it: DecisionTreeClassificationModel, DecisionTreeRegressionModel, RandomForest, DecisionTreeModel, GBTClassificationModel, GBTRegressionModel and an example debug string return by this method is like this: RandomForestClassificationModel (uid=rfc_6c4ceb92ba78) with 20 trees Tree 0 (weight 1.0): If (feature 0 <= 3="" 10="" 1.0)="" if="" (feature="" <="0.0)" predict:="" 0.0="" else=""> 6.0) Predict: 0.0 Else (feature 10 > 0.0) If (feature 12 <= 12="" 63.0)="" predict:="" 0.0="" else="" (feature=""> 63.0) Predict: 0.0 Else (feature 0 > 1.0) If (feature 13 <= 3="" 1.0)="" if="" (feature="" <="3.0)" predict:="" 0.0="" else=""> 3.0) Predict: 1.0 Else (feature 13 > 1.0) If (feature 7 <= 7="" 1.0)="" predict:="" 0.0="" else="" (feature=""> 1.0) Predict: 0.0 Tree 1 ... (repeat count of trees times with different weights) The requirement to acquire needed information from the pipelines is indeed valuable. If estimators or models could rephrase their params with a more friendly description, it would be easier to understand. I think this is the value that the toDebugString mainly provides. Maybe we should stipulate that all models have friendly and well-defined toString() method, or even add a describe() method in the basic interfaces like PipelineStage. But it's not what we should do with the json we discussing. The json is designed to completely describe a pipeline, and should be able to generate from and convert to a pipeline without information loss, which allows it as a storage format. The readability is mostly for users editing a pipeline without editing the code and recompiling it. This makes it more convenient to tune or reuse pipeline. Another question is about initial transformers. It's already planned that next PR is some initial estimators, models and transformers after the api is accepted. They would have well defined toString method to provide enough information a user needs. More will be added since then, and hopefully even more libraries could be contributed to enrich the Flink ML. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services