Hi Becket, Thanks for the review! I totally agree that it would be easier for people to discuss if we can list the fundamental difference between these two proposals. (So I want to make the discussion even shorter)
In my opinion, the fundamental difference between proposal-1 and proposal-2 is how they support the multi-input multi-output (MIMO) machine learning algorithms. Proposal-1 supports MIMO by extending Transformer/Estimator/Pipeline to take multiple inputs and output multiple outputs. Proposal-2 does not change the definition of Transformer/Estimator/Pipeline. Rather, to support MIMO it comes up with a new abstraction --- AlgoOperator, which is essentially an abstraction for machine learning functions. That is, proposal-2 employs well-recognized Transformer/Estimator/Pipeline to support single-input single-output (SISO) machine learning algorithms and AlgoOperator to support MIMO. In my opinion, the benefit of proposal-1 is that there is only one set of API and it is clean. However, it breaks the user habits (SISO for Transformer/Estimator/Pipeline). Users have to think more than before when using the new Transformer/Estimator/Pipeline. [1] The benefit of proposal-2 is that it does not change anything of the well-recognized Transformer/Estimator/Pipeline and existing users (e.g., Spark MLlib users) would be happy. However, as you mentioned, proposal-2 introduces a new abstraction (AlgoOperator), which may increase the burden for understanding. [1] https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do/edit#heading=h.c2qr9r64btd9 Thanks, Zhipeng Zhang Becket Qin <becket....@gmail.com> 于2021年7月20日周二 上午11:42写道: > Hi Dong, Zhipeng and Fan, > > Thanks for the detailed proposals. It is quite a lot of reading! Given > that we are introducing a lot of stuff here, I find that it might be easier > for people to discuss if we can list the fundamental differences first. > From what I understand, the very fundamental difference between the two > proposals is following: > > * In order to support graph structure, do we extend Transformer/Estimator, > or do we introduce a new set of API? * > > Proposal 1 tries to keep one set of API, which is based on > Transformer/Estimator/Pipeline. More specifically, it does the following: > - Make Transformer and Estimator multi-input and multi-output (MIMO). > - Introduce a Graph/GraphModel as counter parts of > Pipeline/PipelineModel. > > Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is. > Instead, it introduces AlgoOperators to support the graph structure. The > AlgoOperators are general-purpose graph nodes supporting MIMO. They are > independent of Pipeline / Transformer / Estimator. > > > My two cents: > > I think it is a big advantage to have a single set of API rather than two > independent sets of API, if possible. But I would suggest we change the > current proposal 1 a little bit, by learning from proposal 2. > > What I like about proposal 1: > 1. A single set of API, symmetric in Graph/GraphModel and > Pipeline/PipelineModel. > 2. Keeping most of the benefits from Transformer/Estimator, including the > fit-then-transform relation and save/load capability. > > However, proposal 1 also introduced some changes that I am not sure about: > > 1. The most controversial part of proposal 1 is whether we should extend > the Transformer/Estimator/Pipeline? In fact, different projects have > slightly different designs for Transformer/Estimator/Pipeline. So I think > it is OK to extend it. However, there are some commonly recognized core > semantics that we ideally want to keep. To me these core semantics are: > 1. Transformer is a Data -> Data conversion, Estimator deals with Data > -> Model conversion. > 2. Estimator.fit() gives a Transformer, and users can just call > Transformer.transform() to perform inference. > To me, as long as these core semantics are kept, extension to the API > seems acceptable. > > Proposal 1 extends the semantic of Transformer from Data -> Data > conversion to generic Table -> Table conversion, and claims it is > equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph node. > It does change the first semantic. That said, this might just be a naming > problem, though. One possible solution to this problem is having a new > subclass of Stage without strong conventional semantics, e.g. "AlgoOp" if > we borrow the name from proposal 2, and let Transformer extend it. Just > like a PipelineModel is a more specific Transformer, a Transformer would be > a more specific "AlgoOp". If we do that, the processing logic that people > don't feel comfortable to be a Transformer can just be put into an "AlgoOp" > and thus can still be added to a Pipeline / Graph. This borrows the > advantage of proposal 2. In another word, this essentially makes the > "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be > added to the Graph and Pipeline if people want to. > > This also gives my thoughts regarding the concern that making the > Transformer/Estimator to MIMO would break the convention of single input > single output (SISO) Transformer/Estimator. Since this does not change the > core semantic of Transformer/Estimator, it sounds an intuitive extension to > me. > > 2. Another semantic related case brought up was heterogeneous topologies > in training and inference. In that case, the input of an Estimator would be > different from the input of the transformer returned by Estimator.fit(). > The example to this case is Word2Vec, where the input of the Estimator > would be an article while the input to the Transformer is a single word. > The well recognized ML Pipeline doesn't seem to support this case, because > it assumes the input of the Estimator and corresponding Transformer are the > same. > > Both proposal 1 and proposal 2 leaves this case unsupported in the > Pipeline. To support this case, > - Proposal 1 adds support to such cases in the Graph/GraphModel by > introducing "EstimatorInput" and "TransformerInput". The downside is that > it complicates the API. > - Proposal 2 leaves this to users to construct two different DAG for > training and inference respectively. This means users would have to > construct the DAG twice even if most parts of the DAG are the same in > training and inference. > > My gut feeling is that this is not a critical difference because such > heterogeneous topology is sort of a corner case. Most users do not need to > worry about this. For those who do need this, either proposal 1 and > proposal 2 seems acceptable to me. That said, it looks that with proposal > 1, users can still choose to write the program twice without using the > Graph API, just like what they do in proposal 2. So technically speaking, > proposal 1 is more flexible and allows users to choose either flavor. On > the other hand, one could argue that proposal 1 may confuse users with > these two flavors. Although personally I feel it is clear to me, I am open > to other ideas. > > 3. Lastly, there was a concern about proposal 1 is that some Estimators > can no longer be added to the Pipeline while the original Pipeline accepts > any Estimator. > > It seems that users have to always make sure the input schema required by > the Estimator matches the input table. So even for the existing Pipeline, > people cannot naively add any Estimator into a pipeline. Admittedly, > proposal 1 added some more requirements, namely 1) the number of inputs > needs to match the number of outputs of the previous stage, and 2) the > Estimator does not generate a transformer with different required input > schema (the heterogeneous case mentioned above). However, given that these > mismatches will result in exceptions at compile time, just like users put > an Estimator with mismatched input schema, personally I find it does not > change the user experience much. > > > So to summarize my thoughts on this fundamental difference. > - In general, I personally prefer having one set of API. > - The current proposal 1 may need some improvements in some cases, by > borrowing something from proposal 2. > > > > A few other differences that I consider as non-fundamental: > > * Do we need a top level encapsulation API for an Algorithm? * > > Proposal 1 has a concept of Graph which encapsulates the entire algorithm > to provide a unified API following the same semantic of > Estimator/Transformer. Users can choose not to package everything into a > Graph, but just write their own program and wrap it as an ordinary function. > > Proposal 2 does not have the top level API such as Graph. Instead, users > can choose to write an arbitrary function if they want to. > > From what I understand, in proposal 1, users may still choose to ignore > Graph API and simply construct a DAG by themselves by calling transform() > and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as > I suggested earlier. So Graph is just an additional way to construct a > graph - people can use Graph in a similar way as they do to the > Pipeline/Pipeline model. In another word, there is no conflict between > proposal 1 and proposal 2. > > > * The ways to describe a Graph? * > > Proposal 1 gives two ways to construct a DAG. > 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well). > 2. using the GraphBuilder API. > > Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a > main output and some other side outputs, so it can call > algoOp1.linkFrom(algoOp2) without specifying the index of the output, at > the cost of wrapping all the Tables into an AlgoOperator. > > The usability argument was mostly around the raw APIs. I don't think the > two APIs differ too much from each other. With the same assumption, > proposal 1 and proposal 2 can probably achieve very similar levels of > usability when describing a Graph, if not exactly the same. > > > There are some more other differences/arguments mentioned between the two > proposals. However, I don't think they are fundamental. And just like the > cases mentioned above, the two proposals can easily learn from each other. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote: > >> Hi all, >> >> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two >> different >> designs to extend Flink ML API to support more use-cases, e.g. expressing >> a >> DAG of preprocessing and training logics. These two designs have been >> documented in FLIP-173 >> < >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783 >> > >> 。 >> >> We have different opinions on the usability and the ease-of-understanding >> of the proposed APIs. It will be really useful to have comments of those >> designs from the open source community and to learn your preferences. >> >> To facilitate the discussion, we have summarized our design principles and >> opinions in this Google doc >> < >> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do >> >. >> Code snippets for a few example use-cases are also provided in this doc to >> demonstrate the difference between these two solutions. >> >> This Flink ML API is super important to the future of Flink ML library. >> Please feel free to reply to this email thread or comment in the Google >> doc >> directly. >> >> Thank you! >> Dong, Zhipeng, Fan >> > -- best, Zhipeng