Thanks for the comments, Fan. Please see the reply inline. On Thu, Aug 19, 2021 at 10:25 PM Fan Hong <hongfa...@gmail.com> wrote:
> Hi, Becket, > > Many thanks to your detailed review. I agree that it is easier to involve > more people to discuss if fundamental differences are highlighted. > > > Here are some of my thoughts to help other people to think about these > differences. (correct me if those technique details are not right.) > > > 1. One set of API or not? May be not that important. > > > First of all, AlgoOperators and Pipeline / Transformer / Estimator in > Proposal 2 are absolutely *NOT* independent. > > > One may think they are independent, because they see Pipeline / > Transformer / Estimator are already in Flink ML Lib and AlgoOperators are > lately added in this proposal. But that's not true. If you check Alink[1] > where the idea of Proposal 2 originated, both of them have been presented > long ago, and they collaborate tightly. > > In the aspects of functionalities, they are also not independent. Their > relation is more like a two-level API to specify ML tasks: AlgoOperators is > a general-purpose level to represent any ML algorithms, while Pipeline / > Transformer / Estimator provides a higher-level API which enables wrapping > multiple ML algorithms together in a fit-transform way. > We probably need to first clarify what "independent" means here. Sure, users can always wrap the Transformer into an AlgoOperator, but users can basically wrap any code, any class into an AlgoOperator. And we wouldn't say AlgoOperator is not independent of any class, right? In my opinion, the two APIs are independent because even if we agree that Transformers are doing things that are conceptually a subset of what AlgoOperators do, a Transformer cannot be used as an AlgoOperator out of the box without wrapping. And even worse, a MIMO AlgoOperator cannot be wrapped into a Transformer / Estimator if these two APIs are SISO. So from what I see, in Option 2, these two APIs are independent from API design perspective. One could consider Flink DataStream - Table as an analogy to AlgoOperators > - Pipeline. The two-level APIs provides different functionalities to end > users, and the higher-level API will call lower-level of API in internal > implementation. I'm not saying the two-level API design in Proposal 2 is > good because Flink already did this. I just hope to help community people > to understand the relation between AlgoOperators and Pipeline. > I am not sure if it is accurate to say DataStream is a low-level API of Table. They are simply two different DSL, one for relational / SQL-like analytics paradigm, and the other for those who are more familiar with streaming applications. More importantly, they are designed to support conversion from one to the other out of the box, which is unlike Pipeline and AlgoOperators in proposal 2. > An additional usage and benefit of Pipeline API is that SISO PipelineModel > corresponds to a deployable unit for online serving exactly. > > In online serving, Flink runtime are usually avoided to achieve low > latency. So models have to be wrapped for transmission from Flink ecosystem > to a non-Flink one. Here is the place where the wrapping is really needed > and inevitable, because the serving service providers are usually expected > to be general to one type of models. Pipeline API in Proposal 2 target to > this scene exactly without complicated APIs. > > Yet, for offline or nearline inference, they can be completed in Flink > ecosystem. That's where Flink ML Lib still exists, so a loose wrapping > using AlgoOperators in Proposal 2 still works with not much overhead. > It seems that a MIMO transformer can easily support all SISO use cases, right? And there is zero overhead because users may not have to wrap AlgoOperators, but can just build a Pipeline directly by putting either Transformer or AlgoOperators into it, without worrying about whether they are interoperable. > At the same time, these two levels of APIs are not redundant in their > functionalities, they have to collaborate to build ML tasks. > > AlgoOperator API is self-consistent and self-complete in constructing ML > tasks, but if users are seeking to wrap a sequence of subtasks, especially > for online serving, Pipeline / Transformer / Estimator API is inevitable. > On the other side, Pipeline / Transformer / Estimator API lacks > completeness, even for the extended version plus Graph API in Proposal 1 > (last case in [4]), so it cannot replace AlgoOperator API. > > One case of their collaboration lies in my response to Mingliang's > recommendation scenarios, where AlgoOperators + Pipeline can provide > cleaner usage than Graph API. > I think the link/linkFrom API is more like a convenient wrapper of fit/transform/compute. Functionality wise, they are equivalent. The Graph/GraphBuilder API, on the other hand, is an encapsulation design, on top of Estimator/Transformer/AlgoOperators. Without the GraphBuilder/Graph API, users would have to create their own class to encapsulate the code. Just like without the Pipeline API, users would have to create their own class to wrap the pipeline logic. So I don't think we should compare link/linkFrom with Graph/GraphBuilder API because they serve different purposes. Even though both of them need to describe a DAG, GraphBuilder is describing for encapsulation while link/linkFrom is not. > > 2. What is core semantics of Pipeline / Transformer / Estimator? > > > I will not give my answer because I can't. I think it would be difficult > to reach an agreement on this. > > But I did two things, and hope they can provide some hints. > > > One thing is to seek answers from other ML libraries. Scikit-learn and > SparkML are well-known general-purpose ML libraries. > > Spark ML gives the definition of Pipeline / Transformer / Estimator in its > documents. Here I quote as follows [2]: > > > *Transformer* >> <https://spark.apache.org/docs/latest/ml-pipeline.html#transformers>: >> A Transformer is an algorithm which can transform *one* DataFrame into >> *another* DataFrame. E.g., an ML model is a Transformer which transforms >> *a* DataFrame with features into *a* DataFrame with predictions. >> *Estimator* >> <https://spark.apache.org/docs/latest/ml-pipeline.html#estimators>: >> An Estimator is an algorithm which can be fit on *a* DataFrame to >> produce a Transformer. E.g., a learning algorithm is an Estimator which >> trains on a DataFrame and produces a model. >> *Pipeline* >> <https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline>: >> A Pipeline chains multiple Transformers and Estimators together to specify >> an ML workflow. > > > SparkML clearly declare the quantity of inputs and outputs for Estimator > and Transformer API. Scikit-learn does not give clear definition, instead > present its APIs [3]: > > > >> *Estimator:*The base object, implements a fit method to learn from data, >> either: >> estimator = estimator.fit(data, targets) >> or: >> estimator = estimator.fit(data) >> >> *Transformer:*For filtering or modifying the data, in a supervised or >> unsupervised way, implements: >> new_data = transformer.transform(data) >> When fitting and transforming can be performed much more efficiently >> together than separately, implements: >> new_data = transformer.fit_transform(data) > > > In their API signatures, one 1 input and 1 output is defined. > > > Another thing I did is to seek some concepts in Big Data APIs to make > analogies to Pipeline / Transformer / Estimator in ML APIs, so non-ML > developers may have a better understanding about their positions in ML APIs. > > At last, I think 'map' in the MapReduce paradigm may be a fair analogy and > easy to understand for everyone. One may think 'map' as the MapFunction or > FlatMapFunction in Flink or Mapper in Hadoop. As far as I know, no Big Data > APIs trying to extend 'map' to support multiple inputs or outputs and still > keep the original name. In Flink, there exists co-Map or co-FlatMap which > can be considered as extensions, yet they did not use the name 'map' anyway. > > > So, the core semantics of 'map' is conversion from data to data, or from 1 > dataset to another dataset? With either answer, the fact is no one breaks > the usage convention of 'map'. > > > This is an interesting discussion. First of all, I don't think "map" is a good comparison here. This method is always defined in a class representing a data collection. So there is no actual data input to the method at all. The only semantic that makes sense is to operate on the data collection the `map` method was defined on. And the parameter of `map` is the processing logic, which would also be weird to have more than one. Regarding scikit-learn and Spark, every API design has their context, targeted use cases and design goals. I think it is more important to analyze and understand the reason WHY their API looks like that and whether they are good designs, instead of just following WHAT they look like. In my opinion, one primary reason is Spark and scikit-learn Pipeline assumes that all the samples, whether for training or inference, are well prepared. It basically excluded the data preparation step from the Pipeline. Take the recommendation systems as an example, it is quite typical that the samples are generated based on user behaviors stored in different dataset, such as exposures, clicks, and maybe also user profiles stored in the relational databases. So MIMO is a must in this case. Today, the data preparation is out of the scope of scikit-learn and not included in its Pipeline API. People usually uses other ways such as Pandas or Spark DataFrame to prepare the data. In order to discuss whether the MIMO Pipeline makes sense, we need to think whether it is valuable to include the data preparation into the Pipeline as well. Personally I think it is a good extension and I don't see much harm in doing so. Apparently, a MIMO Pipeline API would also support SISO Pipeline, so it is conceptually backwards compatible. For those algorithms that only make sense for SISO, they can keep just as is. The only difference is that instead of just returning an output Table, they return a single-table array. And for those who do need MIMO support, we also make them happy. Therefore it looks like a useful feature with little cost. BTW, I have asked a few AI practitioners, including those from industry and academia. The concept of MIMO Pipeline itself seems well accepted. Somewhat surprisingly, although the concept of Transformer / Estimator is understood by most people I talked to, they are not familiar with what a transformer / estimator should look like. I think this is partly because ML Pipeline is a well known concept without a well agreed API. In fact, even Spark and scikit-learn have quite different designs for Estimator / Transformer / Pipeline when it comes to details. > 3. About potential inconsistent availabilit.y of algorithms > > > Becket has mentioned that developers may be confused by how to implement > the same algorithm in two levels of APIs of Proposal 2. > > If one accept the relation between AlgoOperator API and Pipeline API > described before, then it is not a problem. It is natural that developers > implement their algorithms in AlgoOperators, and call AlgoOperators in > Estimator/Transformers. > > > If not, I propose a rough idea here: > > An abstract class AlgoOpEstimatorImpl is provided as a subclass of > Estimator. It has a method named getTrainOp() which returns the > AlgoOperator where the computation logic resides. Other codes in > AlgoOpEstimatorImpl are fixed. In this way, developers of Flink ML Lib are > asked to implement Estimator by inheriting AlgoOpEstimatorImpl. > > Other solutions are also possible, but may still need some community > convention. > > > I also would like to mention the same issue exists in Proposal 1, as there > are also multiple places where developers can implement algorithms. > > I am not sure I fully understand what "there are also multiple places where developers can implement algorithms" means. It is always the algorithm authors' call in terms of how to implement the interfaces. Implementation wise, it is OK to have an abstract class such as AlgoImpl, the algorithm authors can choose to leverage it or not. But in either case, the end users won't see the implementation class and should only rely on public interfaces such as Estimator / Transformer / AlgoOperator, etc. > > In summary, I think the first and second issue above are > preference-related, and hope my thoughts can give some clues. The third > issue can be considered as a common technique problem in both proposals. We > may work together to seek better solutions. > > > Sincerely, > > Fan Hong. > > > > [1] https://github.com/alibaba/Alink > > [2] https://spark.apache.org/docs/latest/ml-pipeline.html > > [3] https://scikit-learn.org/stable/developers/develop.html > > [4] > https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do > > On Tue, Jul 20, 2021 at 11:42 AM Becket Qin <becket....@gmail.com> wrote: > >> Hi Dong, Zhipeng and Fan, >> >> Thanks for the detailed proposals. It is quite a lot of reading! Given >> that we are introducing a lot of stuff here, I find that it might be easier >> for people to discuss if we can list the fundamental differences first. >> From what I understand, the very fundamental difference between the two >> proposals is following: >> >> * In order to support graph structure, do we extend >> Transformer/Estimator, or do we introduce a new set of API? * >> >> Proposal 1 tries to keep one set of API, which is based on >> Transformer/Estimator/Pipeline. More specifically, it does the following: >> - Make Transformer and Estimator multi-input and multi-output (MIMO). >> - Introduce a Graph/GraphModel as counter parts of >> Pipeline/PipelineModel. >> >> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is. >> Instead, it introduces AlgoOperators to support the graph structure. The >> AlgoOperators are general-purpose graph nodes supporting MIMO. They are >> independent of Pipeline / Transformer / Estimator. >> >> >> My two cents: >> >> I think it is a big advantage to have a single set of API rather than two >> independent sets of API, if possible. But I would suggest we change the >> current proposal 1 a little bit, by learning from proposal 2. >> >> What I like about proposal 1: >> 1. A single set of API, symmetric in Graph/GraphModel and >> Pipeline/PipelineModel. >> 2. Keeping most of the benefits from Transformer/Estimator, including the >> fit-then-transform relation and save/load capability. >> >> However, proposal 1 also introduced some changes that I am not sure about: >> >> 1. The most controversial part of proposal 1 is whether we should extend >> the Transformer/Estimator/Pipeline? In fact, different projects have >> slightly different designs for Transformer/Estimator/Pipeline. So I think >> it is OK to extend it. However, there are some commonly recognized core >> semantics that we ideally want to keep. To me these core semantics are: >> 1. Transformer is a Data -> Data conversion, Estimator deals with Data >> -> Model conversion. >> 2. Estimator.fit() gives a Transformer, and users can just call >> Transformer.transform() to perform inference. >> To me, as long as these core semantics are kept, extension to the API >> seems acceptable. >> >> Proposal 1 extends the semantic of Transformer from Data -> Data >> conversion to generic Table -> Table conversion, and claims it is >> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph node. >> It does change the first semantic. That said, this might just be a naming >> problem, though. One possible solution to this problem is having a new >> subclass of Stage without strong conventional semantics, e.g. "AlgoOp" if >> we borrow the name from proposal 2, and let Transformer extend it. Just >> like a PipelineModel is a more specific Transformer, a Transformer would be >> a more specific "AlgoOp". If we do that, the processing logic that people >> don't feel comfortable to be a Transformer can just be put into an "AlgoOp" >> and thus can still be added to a Pipeline / Graph. This borrows the >> advantage of proposal 2. In another word, this essentially makes the >> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be >> added to the Graph and Pipeline if people want to. >> >> This also gives my thoughts regarding the concern that making the >> Transformer/Estimator to MIMO would break the convention of single input >> single output (SISO) Transformer/Estimator. Since this does not change the >> core semantic of Transformer/Estimator, it sounds an intuitive extension to >> me. >> >> 2. Another semantic related case brought up was heterogeneous topologies >> in training and inference. In that case, the input of an Estimator would be >> different from the input of the transformer returned by Estimator.fit(). >> The example to this case is Word2Vec, where the input of the Estimator >> would be an article while the input to the Transformer is a single word. >> The well recognized ML Pipeline doesn't seem to support this case, because >> it assumes the input of the Estimator and corresponding Transformer are the >> same. >> >> Both proposal 1 and proposal 2 leaves this case unsupported in the >> Pipeline. To support this case, >> - Proposal 1 adds support to such cases in the Graph/GraphModel by >> introducing "EstimatorInput" and "TransformerInput". The downside is that >> it complicates the API. >> - Proposal 2 leaves this to users to construct two different DAG for >> training and inference respectively. This means users would have to >> construct the DAG twice even if most parts of the DAG are the same in >> training and inference. >> >> My gut feeling is that this is not a critical difference because such >> heterogeneous topology is sort of a corner case. Most users do not need to >> worry about this. For those who do need this, either proposal 1 and >> proposal 2 seems acceptable to me. That said, it looks that with proposal >> 1, users can still choose to write the program twice without using the >> Graph API, just like what they do in proposal 2. So technically speaking, >> proposal 1 is more flexible and allows users to choose either flavor. On >> the other hand, one could argue that proposal 1 may confuse users with >> these two flavors. Although personally I feel it is clear to me, I am open >> to other ideas. >> >> 3. Lastly, there was a concern about proposal 1 is that some Estimators >> can no longer be added to the Pipeline while the original Pipeline accepts >> any Estimator. >> >> It seems that users have to always make sure the input schema required by >> the Estimator matches the input table. So even for the existing Pipeline, >> people cannot naively add any Estimator into a pipeline. Admittedly, >> proposal 1 added some more requirements, namely 1) the number of inputs >> needs to match the number of outputs of the previous stage, and 2) the >> Estimator does not generate a transformer with different required input >> schema (the heterogeneous case mentioned above). However, given that these >> mismatches will result in exceptions at compile time, just like users put >> an Estimator with mismatched input schema, personally I find it does not >> change the user experience much. >> >> >> So to summarize my thoughts on this fundamental difference. >> - In general, I personally prefer having one set of API. >> - The current proposal 1 may need some improvements in some cases, by >> borrowing something from proposal 2. >> >> >> >> A few other differences that I consider as non-fundamental: >> >> * Do we need a top level encapsulation API for an Algorithm? * >> >> Proposal 1 has a concept of Graph which encapsulates the entire algorithm >> to provide a unified API following the same semantic of >> Estimator/Transformer. Users can choose not to package everything into a >> Graph, but just write their own program and wrap it as an ordinary function. >> >> Proposal 2 does not have the top level API such as Graph. Instead, users >> can choose to write an arbitrary function if they want to. >> >> From what I understand, in proposal 1, users may still choose to ignore >> Graph API and simply construct a DAG by themselves by calling transform() >> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as >> I suggested earlier. So Graph is just an additional way to construct a >> graph - people can use Graph in a similar way as they do to the >> Pipeline/Pipeline model. In another word, there is no conflict between >> proposal 1 and proposal 2. >> >> >> * The ways to describe a Graph? * >> >> Proposal 1 gives two ways to construct a DAG. >> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well). >> 2. using the GraphBuilder API. >> >> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a >> main output and some other side outputs, so it can call >> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at >> the cost of wrapping all the Tables into an AlgoOperator. >> >> The usability argument was mostly around the raw APIs. I don't think the >> two APIs differ too much from each other. With the same assumption, >> proposal 1 and proposal 2 can probably achieve very similar levels of >> usability when describing a Graph, if not exactly the same. >> >> >> There are some more other differences/arguments mentioned between the two >> proposals. However, I don't think they are fundamental. And just like the >> cases mentioned above, the two proposals can easily learn from each other. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote: >> >>> Hi all, >>> >>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two >>> different >>> designs to extend Flink ML API to support more use-cases, e.g. >>> expressing a >>> DAG of preprocessing and training logics. These two designs have been >>> documented in FLIP-173 >>> < >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783 >>> > >>> 。 >>> >>> We have different opinions on the usability and the ease-of-understanding >>> of the proposed APIs. It will be really useful to have comments of those >>> designs from the open source community and to learn your preferences. >>> >>> To facilitate the discussion, we have summarized our design principles >>> and >>> opinions in this Google doc >>> < >>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do >>> >. >>> Code snippets for a few example use-cases are also provided in this doc >>> to >>> demonstrate the difference between these two solutions. >>> >>> This Flink ML API is super important to the future of Flink ML library. >>> Please feel free to reply to this email thread or comment in the Google >>> doc >>> directly. >>> >>> Thank you! >>> Dong, Zhipeng, Fan >>> >>