Hi Becket, Thank you for the detailed reply!
My understanding of your comments is that most of option-1 looks good except its change of the Transformer semantics. Please see my reply inline. On Tue, Jul 20, 2021 at 11:43 AM Becket Qin <becket....@gmail.com> wrote: > Hi Dong, Zhipeng and Fan, > > Thanks for the detailed proposals. It is quite a lot of reading! Given that > we are introducing a lot of stuff here, I find that it might be easier for > people to discuss if we can list the fundamental differences first. From > what I understand, the very fundamental difference between the two > proposals is following: > > * In order to support graph structure, do we extend Transformer/Estimator, > or do we introduce a new set of API? * > > Proposal 1 tries to keep one set of API, which is based on > Transformer/Estimator/Pipeline. More specifically, it does the following: > - Make Transformer and Estimator multi-input and multi-output (MIMO). > - Introduce a Graph/GraphModel as counter parts of > Pipeline/PipelineModel. > > Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is. > Instead, it introduces AlgoOperators to support the graph structure. The > AlgoOperators are general-purpose graph nodes supporting MIMO. They are > independent of Pipeline / Transformer / Estimator. > > > My two cents: > > I think it is a big advantage to have a single set of API rather than two > independent sets of API, if possible. But I would suggest we change the > current proposal 1 a little bit, by learning from proposal 2. > > What I like about proposal 1: > 1. A single set of API, symmetric in Graph/GraphModel and > Pipeline/PipelineModel. > 2. Keeping most of the benefits from Transformer/Estimator, including the > fit-then-transform relation and save/load capability. > > However, proposal 1 also introduced some changes that I am not sure about: > > 1. The most controversial part of proposal 1 is whether we should extend > the Transformer/Estimator/Pipeline? In fact, different projects have > slightly different designs for Transformer/Estimator/Pipeline. So I think > it is OK to extend it. However, there are some commonly recognized core > semantics that we ideally want to keep. To me these core semantics are: > 1. Transformer is a Data -> Data conversion, Estimator deals with Data -> > Model conversion. > 2. Estimator.fit() gives a Transformer, and users can just call > Transformer.transform() to perform inference. > To me, as long as these core semantics are kept, extension to the API seems > acceptable. > > Proposal 1 extends the semantic of Transformer from Data -> Data conversion > to generic Table -> Table conversion, and claims it is equivalent to > I am a bit confused by this description of the semantic change. By "from Data -> Data conversion to generic Table -> Table", do you mean "Table != Data"? I am hoping we can clearly define the semantics of Transformer and agree on its semantics definition within the community. Then we can see how option-1 changes its semantics and update option-1 as appropriate. "AlgoOperator" in proposal 2 as a general-purpose graph node. It does > change the first semantic. That said, this might just be a naming problem, > though. One possible solution to this problem is having a new subclass of > Stage without strong conventional semantics, e.g. "AlgoOp" if we borrow the > name from proposal 2, and let Transformer extend it. Just like a > Assuming that option-1 does change the meaning of Transformer (this point depends on the answer to the previous question), I am happy to update option-1 to address the concern. Here is how I would update option-1: - Add an interface named "MLOperator". This interface will be the same as the existing Transformer interface in option-1 except its name. - Make Transformer a subclass of MLOperator. Because Transformer has exactly the same API as MLOperator, no extra method is defined in Transformer. - Every other API/class in option-1 which deals with Transformer will now deal with MLOperator. This change proposed above basically renamed Transformer to address the semantic/naming concern. Does this sound good? > PipelineModel is a more specific Transformer, a Transformer would be a more > specific "AlgoOp". If we do that, the processing logic that people don't > feel comfortable to be a Transformer can just be put into an "AlgoOp" and > thus can still be added to a Pipeline / Graph. This borrows the advantage > of proposal 2. In another word, this essentially makes the "AlgoOp" > equivalent of "AlgoOperator" in proposal 2, but allows it to be added to > the Graph and Pipeline if people want to. > > This also gives my thoughts regarding the concern that making the > Transformer/Estimator to MIMO would break the convention of single input > single output (SISO) Transformer/Estimator. Since this does not change the > core semantic of Transformer/Estimator, it sounds an intuitive extension to > me. > > 2. Another semantic related case brought up was heterogeneous topologies in > training and inference. In that case, the input of an Estimator would be > different from the input of the transformer returned by Estimator.fit(). > The example to this case is Word2Vec, where the input of the Estimator > would be an article while the input to the Transformer is a single word. > The well recognized ML Pipeline doesn't seem to support this case, because > it assumes the input of the Estimator and corresponding Transformer are the > same. > > Both proposal 1 and proposal 2 leaves this case unsupported in the > Pipeline. To support this case, > - Proposal 1 adds support to such cases in the Graph/GraphModel by > introducing "EstimatorInput" and "TransformerInput". The downside is that > it complicates the API. > - Proposal 2 leaves this to users to construct two different DAG for > training and inference respectively. This means users would have to > construct the DAG twice even if most parts of the DAG are the same in > training and inference. > > My gut feeling is that this is not a critical difference because such > heterogeneous topology is sort of a corner case. Most users do not need to > worry about this. For those who do need this, either proposal 1 and > proposal 2 seems acceptable to me. That said, it looks that with proposal > 1, users can still choose to write the program twice without using the > Graph API, just like what they do in proposal 2. So technically speaking, > proposal 1 is more flexible and allows users to choose either flavor. On > the other hand, one could argue that proposal 1 may confuse users with > these two flavors. Although personally I feel it is clear to me, I am open > to other ideas. > > 3. Lastly, there was a concern about proposal 1 is that some Estimators can > no longer be added to the Pipeline while the original Pipeline accepts any > Estimator. > > It seems that users have to always make sure the input schema required by > the Estimator matches the input table. So even for the existing Pipeline, > people cannot naively add any Estimator into a pipeline. Admittedly, > proposal 1 added some more requirements, namely 1) the number of inputs > needs to match the number of outputs of the previous stage, and 2) the > Estimator does not generate a transformer with different required input > schema (the heterogeneous case mentioned above). However, given that these > mismatches will result in exceptions at compile time, just like users put > an Estimator with mismatched input schema, personally I find it does not > change the user experience much. > > > So to summarize my thoughts on this fundamental difference. > - In general, I personally prefer having one set of API. > - The current proposal 1 may need some improvements in some cases, by > borrowing something from proposal 2. > > > > A few other differences that I consider as non-fundamental: > > * Do we need a top level encapsulation API for an Algorithm? * > > Proposal 1 has a concept of Graph which encapsulates the entire algorithm > to provide a unified API following the same semantic of > Estimator/Transformer. Users can choose not to package everything into a > Graph, but just write their own program and wrap it as an ordinary > function. > > Proposal 2 does not have the top level API such as Graph. Instead, users > can choose to write an arbitrary function if they want to. > > From what I understand, in proposal 1, users may still choose to ignore > Graph API and simply construct a DAG by themselves by calling transform() > and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as > I suggested earlier. So Graph is just an additional way to construct a > graph - people can use Graph in a similar way as they do to the > Pipeline/Pipeline model. In another word, there is no conflict between > proposal 1 and proposal 2. > > > * The ways to describe a Graph? * > > Proposal 1 gives two ways to construct a DAG. > 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well). > 2. using the GraphBuilder API. > > Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a > main output and some other side outputs, so it can call > algoOp1.linkFrom(algoOp2) without specifying the index of the output, at > the cost of wrapping all the Tables into an AlgoOperator. > > The usability argument was mostly around the raw APIs. I don't think the > two APIs differ too much from each other. With the same assumption, > proposal 1 and proposal 2 can probably achieve very similar levels of > usability when describing a Graph, if not exactly the same. > > > There are some more other differences/arguments mentioned between the two > proposals. However, I don't think they are fundamental. And just like the > cases mentioned above, the two proposals can easily learn from each other. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote: > > > Hi all, > > > > Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two > different > > designs to extend Flink ML API to support more use-cases, e.g. > expressing a > > DAG of preprocessing and training logics. These two designs have been > > documented in FLIP-173 > > < > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783 > > > > > 。 > > > > We have different opinions on the usability and the ease-of-understanding > > of the proposed APIs. It will be really useful to have comments of those > > designs from the open source community and to learn your preferences. > > > > To facilitate the discussion, we have summarized our design principles > and > > opinions in this Google doc > > < > > > https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do > > >. > > Code snippets for a few example use-cases are also provided in this doc > to > > demonstrate the difference between these two solutions. > > > > This Flink ML API is super important to the future of Flink ML library. > > Please feel free to reply to this email thread or comment in the Google > doc > > directly. > > > > Thank you! > > Dong, Zhipeng, Fan > > >