Hi everyone, Based on the feedback received in the online/offline discussion in the past few weeks, we (Zhepeng, Fan, myself and a few other developers at Alibaba) have reached agreement on the design to support DAG of algorithms. We have merged the ideas from the intial two options into this FLIP-176 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783> design doc.
If you have comments on the latest design doc, please let us know! Cheers, Dong On Mon, Aug 23, 2021 at 5:07 PM Becket Qin <becket....@gmail.com> wrote: > Thanks for the comments, Fan. Please see the reply inline. > > On Thu, Aug 19, 2021 at 10:25 PM Fan Hong <hongfa...@gmail.com> wrote: > > > Hi, Becket, > > > > Many thanks to your detailed review. I agree that it is easier to involve > > more people to discuss if fundamental differences are highlighted. > > > > > > Here are some of my thoughts to help other people to think about these > > differences. (correct me if those technique details are not right.) > > > > > > 1. One set of API or not? May be not that important. > > > > > > First of all, AlgoOperators and Pipeline / Transformer / Estimator in > > Proposal 2 are absolutely *NOT* independent. > > > > > > One may think they are independent, because they see Pipeline / > > Transformer / Estimator are already in Flink ML Lib and AlgoOperators are > > lately added in this proposal. But that's not true. If you check Alink[1] > > where the idea of Proposal 2 originated, both of them have been presented > > long ago, and they collaborate tightly. > > > > > In the aspects of functionalities, they are also not independent. Their > > relation is more like a two-level API to specify ML tasks: AlgoOperators > is > > a general-purpose level to represent any ML algorithms, while Pipeline / > > Transformer / Estimator provides a higher-level API which enables > wrapping > > multiple ML algorithms together in a fit-transform way. > > > > We probably need to first clarify what "independent" means here. Sure, > users can always wrap the Transformer into an AlgoOperator, but users can > basically wrap any code, any class into an AlgoOperator. And we wouldn't > say AlgoOperator is not independent of any class, right? In my opinion, the > two APIs are independent because even if we agree that Transformers are > doing things that are conceptually a subset of what AlgoOperators do, a > Transformer cannot be used as an AlgoOperator out of the box without > wrapping. And even worse, a MIMO AlgoOperator cannot be wrapped into a > Transformer / Estimator if these two APIs are SISO. So from what I see, in > Option 2, these two APIs are independent from API design perspective. > > One could consider Flink DataStream - Table as an analogy to AlgoOperators > > - Pipeline. The two-level APIs provides different functionalities to end > > users, and the higher-level API will call lower-level of API in internal > > implementation. I'm not saying the two-level API design in Proposal 2 is > > good because Flink already did this. I just hope to help community people > > to understand the relation between AlgoOperators and Pipeline. > > > > I am not sure if it is accurate to say DataStream is a low-level API of > Table. They are simply two different DSL, one for relational / SQL-like > analytics paradigm, and the other for those who are more familiar with > streaming applications. More importantly, they are designed to > support conversion from one to the other out of the box, which is unlike > Pipeline and AlgoOperators in proposal 2. > > > > An additional usage and benefit of Pipeline API is that SISO > PipelineModel > > corresponds to a deployable unit for online serving exactly. > > > > In online serving, Flink runtime are usually avoided to achieve low > > latency. So models have to be wrapped for transmission from Flink > ecosystem > > to a non-Flink one. Here is the place where the wrapping is really needed > > and inevitable, because the serving service providers are usually > expected > > to be general to one type of models. Pipeline API in Proposal 2 target to > > this scene exactly without complicated APIs. > > > > Yet, for offline or nearline inference, they can be completed in Flink > > ecosystem. That's where Flink ML Lib still exists, so a loose wrapping > > using AlgoOperators in Proposal 2 still works with not much overhead. > > > > It seems that a MIMO transformer can easily support all SISO use cases, > right? And there is zero overhead because users may not have to wrap > AlgoOperators, but can just build a Pipeline directly by putting either > Transformer or AlgoOperators into it, without worrying about whether they > are interoperable. > > > > At the same time, these two levels of APIs are not redundant in their > > functionalities, they have to collaborate to build ML tasks. > > > > AlgoOperator API is self-consistent and self-complete in constructing ML > > tasks, but if users are seeking to wrap a sequence of subtasks, > especially > > for online serving, Pipeline / Transformer / Estimator API is inevitable. > > On the other side, Pipeline / Transformer / Estimator API lacks > > completeness, even for the extended version plus Graph API in Proposal 1 > > (last case in [4]), so it cannot replace AlgoOperator API. > > > > One case of their collaboration lies in my response to Mingliang's > > recommendation scenarios, where AlgoOperators + Pipeline can provide > > cleaner usage than Graph API. > > > > I think the link/linkFrom API is more like a convenient wrapper of > fit/transform/compute. Functionality wise, they are equivalent. The > Graph/GraphBuilder API, on the other hand, is an encapsulation design, on > top of Estimator/Transformer/AlgoOperators. Without the GraphBuilder/Graph > API, users would have to create their own class to encapsulate the code. > Just like without the Pipeline API, users would have to create their own > class to wrap the pipeline logic. So I don't think we should compare > link/linkFrom with Graph/GraphBuilder API because they serve different > purposes. Even though both of them need to describe a DAG, GraphBuilder is > describing for encapsulation while link/linkFrom is not. > > > > > > 2. What is core semantics of Pipeline / Transformer / Estimator? > > > > > > I will not give my answer because I can't. I think it would be difficult > > to reach an agreement on this. > > > > But I did two things, and hope they can provide some hints. > > > > > > One thing is to seek answers from other ML libraries. Scikit-learn and > > SparkML are well-known general-purpose ML libraries. > > > > Spark ML gives the definition of Pipeline / Transformer / Estimator in > its > > documents. Here I quote as follows [2]: > > > > > > *Transformer* > >> <https://spark.apache.org/docs/latest/ml-pipeline.html#transformers>: > >> A Transformer is an algorithm which can transform *one* DataFrame into > >> *another* DataFrame. E.g., an ML model is a Transformer which transforms > >> *a* DataFrame with features into *a* DataFrame with predictions. > >> *Estimator* > >> <https://spark.apache.org/docs/latest/ml-pipeline.html#estimators>: > >> An Estimator is an algorithm which can be fit on *a* DataFrame to > >> produce a Transformer. E.g., a learning algorithm is an Estimator which > >> trains on a DataFrame and produces a model. > >> *Pipeline* > >> <https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline>: > >> A Pipeline chains multiple Transformers and Estimators together to > specify > >> an ML workflow. > > > > > > SparkML clearly declare the quantity of inputs and outputs for Estimator > > and Transformer API. Scikit-learn does not give clear definition, instead > > present its APIs [3]: > > > > > > > >> *Estimator:*The base object, implements a fit method to learn from data, > >> either: > >> estimator = estimator.fit(data, targets) > >> or: > >> estimator = estimator.fit(data) > >> > >> *Transformer:*For filtering or modifying the data, in a supervised or > >> unsupervised way, implements: > >> new_data = transformer.transform(data) > >> When fitting and transforming can be performed much more efficiently > >> together than separately, implements: > >> new_data = transformer.fit_transform(data) > > > > > > In their API signatures, one 1 input and 1 output is defined. > > > > > > Another thing I did is to seek some concepts in Big Data APIs to make > > analogies to Pipeline / Transformer / Estimator in ML APIs, so non-ML > > developers may have a better understanding about their positions in ML > APIs. > > > > At last, I think 'map' in the MapReduce paradigm may be a fair analogy > and > > easy to understand for everyone. One may think 'map' as the MapFunction > or > > FlatMapFunction in Flink or Mapper in Hadoop. As far as I know, no Big > Data > > APIs trying to extend 'map' to support multiple inputs or outputs and > still > > keep the original name. In Flink, there exists co-Map or co-FlatMap which > > can be considered as extensions, yet they did not use the name 'map' > anyway. > > > > > > So, the core semantics of 'map' is conversion from data to data, or from > 1 > > dataset to another dataset? With either answer, the fact is no one breaks > > the usage convention of 'map'. > > > > > > > This is an interesting discussion. First of all, I don't think "map" is a > good comparison here. This method is always defined in a class representing > a data collection. So there is no actual data input to the method at all. > The only semantic that makes sense is to operate on the data collection the > `map` method was defined on. And the parameter of `map` is the processing > logic, which would also be weird to have more than one. > > Regarding scikit-learn and Spark, every API design has their context, > targeted use cases and design goals. I think it is more important to > analyze and understand the reason WHY their API looks like that and whether > they are good designs, instead of just following WHAT they look like. > > In my opinion, one primary reason is Spark and scikit-learn Pipeline > assumes that all the samples, whether for training or inference, are well > prepared. It basically excluded the data preparation step from the > Pipeline. Take the recommendation systems as an example, it is quite > typical that the samples are generated based on user behaviors stored in > different dataset, such as exposures, clicks, and maybe also user profiles > stored in the relational databases. So MIMO is a must in this case. Today, > the data preparation is out of the scope of scikit-learn and not included > in its Pipeline API. People usually uses other ways such as Pandas or Spark > DataFrame to prepare the data. > > In order to discuss whether the MIMO Pipeline makes sense, we need to think > whether it is valuable to include the data preparation into the Pipeline as > well. Personally I think it is a good extension and I don't see much harm > in doing so. Apparently, a MIMO Pipeline API would also support SISO > Pipeline, so it is conceptually backwards compatible. For those algorithms > that only make sense for SISO, they can keep just as is. The only > difference is that instead of just returning an output Table, they return a > single-table array. And for those who do need MIMO support, we also make > them happy. Therefore it looks like a useful feature with little cost. > > BTW, I have asked a few AI practitioners, including those from industry and > academia. The concept of MIMO Pipeline itself seems well accepted. Somewhat > surprisingly, although the concept of Transformer / Estimator is understood > by most people I talked to, they are not familiar with what a transformer / > estimator should look like. I think this is partly because ML Pipeline is a > well known concept without a well agreed API. In fact, even Spark and > scikit-learn have quite different designs for Estimator / Transformer / > Pipeline when it comes to details. > > > > 3. About potential inconsistent availabilit.y of algorithms > > > > > > Becket has mentioned that developers may be confused by how to implement > > the same algorithm in two levels of APIs of Proposal 2. > > > > If one accept the relation between AlgoOperator API and Pipeline API > > described before, then it is not a problem. It is natural that developers > > implement their algorithms in AlgoOperators, and call AlgoOperators in > > Estimator/Transformers. > > > > > > If not, I propose a rough idea here: > > > > An abstract class AlgoOpEstimatorImpl is provided as a subclass of > > Estimator. It has a method named getTrainOp() which returns the > > AlgoOperator where the computation logic resides. Other codes in > > AlgoOpEstimatorImpl are fixed. In this way, developers of Flink ML Lib > are > > asked to implement Estimator by inheriting AlgoOpEstimatorImpl. > > > > Other solutions are also possible, but may still need some community > > convention. > > > > > > I also would like to mention the same issue exists in Proposal 1, as > there > > are also multiple places where developers can implement algorithms. > > > > > I am not sure I fully understand what "there are also multiple places where > developers can implement algorithms" means. It is always the algorithm > authors' call in terms of how to implement the interfaces. Implementation > wise, it is OK to have an abstract class such as AlgoImpl, the algorithm > authors can choose to leverage it or not. But in either case, the end users > won't see the implementation class and should only rely on public > interfaces such as Estimator / Transformer / AlgoOperator, etc. > > > > > > > In summary, I think the first and second issue above are > > preference-related, and hope my thoughts can give some clues. The third > > issue can be considered as a common technique problem in both proposals. > We > > may work together to seek better solutions. > > > > > > Sincerely, > > > > Fan Hong. > > > > > > > > [1] https://github.com/alibaba/Alink > > > > [2] https://spark.apache.org/docs/latest/ml-pipeline.html > > > > [3] https://scikit-learn.org/stable/developers/develop.html > > > > [4] > > > https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do > > > > On Tue, Jul 20, 2021 at 11:42 AM Becket Qin <becket....@gmail.com> > wrote: > > > >> Hi Dong, Zhipeng and Fan, > >> > >> Thanks for the detailed proposals. It is quite a lot of reading! Given > >> that we are introducing a lot of stuff here, I find that it might be > easier > >> for people to discuss if we can list the fundamental differences first. > >> From what I understand, the very fundamental difference between the two > >> proposals is following: > >> > >> * In order to support graph structure, do we extend > >> Transformer/Estimator, or do we introduce a new set of API? * > >> > >> Proposal 1 tries to keep one set of API, which is based on > >> Transformer/Estimator/Pipeline. More specifically, it does the > following: > >> - Make Transformer and Estimator multi-input and multi-output > (MIMO). > >> - Introduce a Graph/GraphModel as counter parts of > >> Pipeline/PipelineModel. > >> > >> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is. > >> Instead, it introduces AlgoOperators to support the graph structure. The > >> AlgoOperators are general-purpose graph nodes supporting MIMO. They are > >> independent of Pipeline / Transformer / Estimator. > >> > >> > >> My two cents: > >> > >> I think it is a big advantage to have a single set of API rather than > two > >> independent sets of API, if possible. But I would suggest we change the > >> current proposal 1 a little bit, by learning from proposal 2. > >> > >> What I like about proposal 1: > >> 1. A single set of API, symmetric in Graph/GraphModel and > >> Pipeline/PipelineModel. > >> 2. Keeping most of the benefits from Transformer/Estimator, including > the > >> fit-then-transform relation and save/load capability. > >> > >> However, proposal 1 also introduced some changes that I am not sure > about: > >> > >> 1. The most controversial part of proposal 1 is whether we should extend > >> the Transformer/Estimator/Pipeline? In fact, different projects have > >> slightly different designs for Transformer/Estimator/Pipeline. So I > think > >> it is OK to extend it. However, there are some commonly recognized core > >> semantics that we ideally want to keep. To me these core semantics are: > >> 1. Transformer is a Data -> Data conversion, Estimator deals with Data > >> -> Model conversion. > >> 2. Estimator.fit() gives a Transformer, and users can just call > >> Transformer.transform() to perform inference. > >> To me, as long as these core semantics are kept, extension to the API > >> seems acceptable. > >> > >> Proposal 1 extends the semantic of Transformer from Data -> Data > >> conversion to generic Table -> Table conversion, and claims it is > >> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph > node. > >> It does change the first semantic. That said, this might just be a > naming > >> problem, though. One possible solution to this problem is having a new > >> subclass of Stage without strong conventional semantics, e.g. "AlgoOp" > if > >> we borrow the name from proposal 2, and let Transformer extend it. Just > >> like a PipelineModel is a more specific Transformer, a Transformer > would be > >> a more specific "AlgoOp". If we do that, the processing logic that > people > >> don't feel comfortable to be a Transformer can just be put into an > "AlgoOp" > >> and thus can still be added to a Pipeline / Graph. This borrows the > >> advantage of proposal 2. In another word, this essentially makes the > >> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be > >> added to the Graph and Pipeline if people want to. > >> > >> This also gives my thoughts regarding the concern that making the > >> Transformer/Estimator to MIMO would break the convention of single input > >> single output (SISO) Transformer/Estimator. Since this does not change > the > >> core semantic of Transformer/Estimator, it sounds an intuitive > extension to > >> me. > >> > >> 2. Another semantic related case brought up was heterogeneous topologies > >> in training and inference. In that case, the input of an Estimator > would be > >> different from the input of the transformer returned by Estimator.fit(). > >> The example to this case is Word2Vec, where the input of the Estimator > >> would be an article while the input to the Transformer is a single word. > >> The well recognized ML Pipeline doesn't seem to support this case, > because > >> it assumes the input of the Estimator and corresponding Transformer are > the > >> same. > >> > >> Both proposal 1 and proposal 2 leaves this case unsupported in the > >> Pipeline. To support this case, > >> - Proposal 1 adds support to such cases in the Graph/GraphModel by > >> introducing "EstimatorInput" and "TransformerInput". The downside is > that > >> it complicates the API. > >> - Proposal 2 leaves this to users to construct two different DAG for > >> training and inference respectively. This means users would have to > >> construct the DAG twice even if most parts of the DAG are the same in > >> training and inference. > >> > >> My gut feeling is that this is not a critical difference because such > >> heterogeneous topology is sort of a corner case. Most users do not need > to > >> worry about this. For those who do need this, either proposal 1 and > >> proposal 2 seems acceptable to me. That said, it looks that with > proposal > >> 1, users can still choose to write the program twice without using the > >> Graph API, just like what they do in proposal 2. So technically > speaking, > >> proposal 1 is more flexible and allows users to choose either flavor. On > >> the other hand, one could argue that proposal 1 may confuse users with > >> these two flavors. Although personally I feel it is clear to me, I am > open > >> to other ideas. > >> > >> 3. Lastly, there was a concern about proposal 1 is that some Estimators > >> can no longer be added to the Pipeline while the original Pipeline > accepts > >> any Estimator. > >> > >> It seems that users have to always make sure the input schema required > by > >> the Estimator matches the input table. So even for the existing > Pipeline, > >> people cannot naively add any Estimator into a pipeline. Admittedly, > >> proposal 1 added some more requirements, namely 1) the number of inputs > >> needs to match the number of outputs of the previous stage, and 2) the > >> Estimator does not generate a transformer with different required input > >> schema (the heterogeneous case mentioned above). However, given that > these > >> mismatches will result in exceptions at compile time, just like users > put > >> an Estimator with mismatched input schema, personally I find it does not > >> change the user experience much. > >> > >> > >> So to summarize my thoughts on this fundamental difference. > >> - In general, I personally prefer having one set of API. > >> - The current proposal 1 may need some improvements in some cases, > by > >> borrowing something from proposal 2. > >> > >> > >> > >> A few other differences that I consider as non-fundamental: > >> > >> * Do we need a top level encapsulation API for an Algorithm? * > >> > >> Proposal 1 has a concept of Graph which encapsulates the entire > algorithm > >> to provide a unified API following the same semantic of > >> Estimator/Transformer. Users can choose not to package everything into a > >> Graph, but just write their own program and wrap it as an ordinary > function. > >> > >> Proposal 2 does not have the top level API such as Graph. Instead, users > >> can choose to write an arbitrary function if they want to. > >> > >> From what I understand, in proposal 1, users may still choose to ignore > >> Graph API and simply construct a DAG by themselves by calling > transform() > >> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 > as > >> I suggested earlier. So Graph is just an additional way to construct a > >> graph - people can use Graph in a similar way as they do to the > >> Pipeline/Pipeline model. In another word, there is no conflict between > >> proposal 1 and proposal 2. > >> > >> > >> * The ways to describe a Graph? * > >> > >> Proposal 1 gives two ways to construct a DAG. > >> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as > well). > >> 2. using the GraphBuilder API. > >> > >> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a > >> main output and some other side outputs, so it can call > >> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at > >> the cost of wrapping all the Tables into an AlgoOperator. > >> > >> The usability argument was mostly around the raw APIs. I don't think the > >> two APIs differ too much from each other. With the same assumption, > >> proposal 1 and proposal 2 can probably achieve very similar levels of > >> usability when describing a Graph, if not exactly the same. > >> > >> > >> There are some more other differences/arguments mentioned between the > two > >> proposals. However, I don't think they are fundamental. And just like > the > >> cases mentioned above, the two proposals can easily learn from each > other. > >> > >> Thanks, > >> > >> Jiangjie (Becket) Qin > >> > >> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote: > >> > >>> Hi all, > >>> > >>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two > >>> different > >>> designs to extend Flink ML API to support more use-cases, e.g. > >>> expressing a > >>> DAG of preprocessing and training logics. These two designs have been > >>> documented in FLIP-173 > >>> < > >>> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783 > >>> > > >>> 。 > >>> > >>> We have different opinions on the usability and the > ease-of-understanding > >>> of the proposed APIs. It will be really useful to have comments of > those > >>> designs from the open source community and to learn your preferences. > >>> > >>> To facilitate the discussion, we have summarized our design principles > >>> and > >>> opinions in this Google doc > >>> < > >>> > https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do > >>> >. > >>> Code snippets for a few example use-cases are also provided in this doc > >>> to > >>> demonstrate the difference between these two solutions. > >>> > >>> This Flink ML API is super important to the future of Flink ML library. > >>> Please feel free to reply to this email thread or comment in the Google > >>> doc > >>> directly. > >>> > >>> Thank you! > >>> Dong, Zhipeng, Fan > >>> > >> >