[GitHub] [flink] c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the interface of ML pipeline and ML lib

GitBox Tue, 21 May 2019 23:54:27 -0700

c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the 
interface of ML pipeline and ML lib
URL: https://github.com/apache/flink/pull/8402#discussion_r286336701


 ##########
 File path: 
flink-ml/flink-ml-api/src/main/java/org/apache/flink/ml/api/misc/persist/Persistable.java
 ##########
 @@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.api.misc.persist;
+
+/**
+ * An interface to allow PipelineStage persistence and reload. As of now, we 
are using JSON as
+ * format.
+ */
+public interface Persistable {
 
 Review comment:
   Thanks for your explain. I found this example code in SparkML. It’s mainly 
use toDebugString method to acquire the info. But since I read the code, I 
found that the toDebugString method is provided only by a few models, most of 
which are tree models.
   
   Here are all models I found that has this function or something like it:
   DecisionTreeClassificationModel,
   DecisionTreeRegressionModel,
   RandomForest,
   DecisionTreeModel,
   GBTClassificationModel,
   GBTRegressionModel
   
   and an example debug string return by this method is like this:
   RandomForestClassificationModel (uid=rfc_6c4ceb92ba78) with 20 trees
     Tree 0 (weight 1.0):
       If (feature 0 <= 3="" 10="" 1.0)="" if="" (feature="" <="0.0)" 
predict:="" 0.0="" else=""> 6.0)
          Predict: 0.0
        Else (feature 10 > 0.0)
         If (feature 12 <= 12="" 63.0)="" predict:="" 0.0="" else="" 
(feature=""> 63.0)
          Predict: 0.0
       Else (feature 0 > 1.0)
        If (feature 13 <= 3="" 1.0)="" if="" (feature="" <="3.0)" predict:="" 
0.0="" else=""> 3.0)
          Predict: 1.0
        Else (feature 13 > 1.0)
         If (feature 7 <= 7="" 1.0)="" predict:="" 0.0="" else="" (feature=""> 
1.0)
          Predict: 0.0
     Tree 1 ... (repeat count of trees times with different weights)
   
   The requirement to acquire needed information from the pipelines is indeed 
valuable. If estimators or models could rephrase their params with a more 
friendly description, it would be easier to understand. I think this is the 
value that the toDebugString mainly provides. Maybe we should stipulate that 
all models have friendly and well-defined toString() method, or even add a 
describe() method in the basic interfaces like PipelineStage.
   
   But it's not what we should do with the json we discussing. The json is 
designed to completely describe a pipeline, and should be able to generate from 
and convert to a pipeline without information loss, which allows it as a 
storage format. The readability is mostly for users editing a pipeline without 
editing the code and recompiling it. This makes it more convenient to tune or 
reuse pipeline. 
   
   Another question is about initial transformers. It's already planned that 
next PR is some initial estimators, models and transformers after the api is 
accepted. They would have well defined toString method to provide enough 
information a user needs. More will be added since then, and hopefully even 
more libraries could be contributed to enrich the Flink ML.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] c4emmmm commented on a change in pull request #8402: [FLINK-12473][ml] Add the interface of ML pipeline and ML lib

Reply via email to