Hi Dong, Zhipeng and Fan, Thanks for the detailed proposals. It is quite a lot of reading! Given that we are introducing a lot of stuff here, I find that it might be easier for people to discuss if we can list the fundamental differences first. From what I understand, the very fundamental difference between the two proposals is following:
* In order to support graph structure, do we extend Transformer/Estimator, or do we introduce a new set of API? * Proposal 1 tries to keep one set of API, which is based on Transformer/Estimator/Pipeline. More specifically, it does the following: - Make Transformer and Estimator multi-input and multi-output (MIMO). - Introduce a Graph/GraphModel as counter parts of Pipeline/PipelineModel. Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is. Instead, it introduces AlgoOperators to support the graph structure. The AlgoOperators are general-purpose graph nodes supporting MIMO. They are independent of Pipeline / Transformer / Estimator. My two cents: I think it is a big advantage to have a single set of API rather than two independent sets of API, if possible. But I would suggest we change the current proposal 1 a little bit, by learning from proposal 2. What I like about proposal 1: 1. A single set of API, symmetric in Graph/GraphModel and Pipeline/PipelineModel. 2. Keeping most of the benefits from Transformer/Estimator, including the fit-then-transform relation and save/load capability. However, proposal 1 also introduced some changes that I am not sure about: 1. The most controversial part of proposal 1 is whether we should extend the Transformer/Estimator/Pipeline? In fact, different projects have slightly different designs for Transformer/Estimator/Pipeline. So I think it is OK to extend it. However, there are some commonly recognized core semantics that we ideally want to keep. To me these core semantics are: 1. Transformer is a Data -> Data conversion, Estimator deals with Data -> Model conversion. 2. Estimator.fit() gives a Transformer, and users can just call Transformer.transform() to perform inference. To me, as long as these core semantics are kept, extension to the API seems acceptable. Proposal 1 extends the semantic of Transformer from Data -> Data conversion to generic Table -> Table conversion, and claims it is equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph node. It does change the first semantic. That said, this might just be a naming problem, though. One possible solution to this problem is having a new subclass of Stage without strong conventional semantics, e.g. "AlgoOp" if we borrow the name from proposal 2, and let Transformer extend it. Just like a PipelineModel is a more specific Transformer, a Transformer would be a more specific "AlgoOp". If we do that, the processing logic that people don't feel comfortable to be a Transformer can just be put into an "AlgoOp" and thus can still be added to a Pipeline / Graph. This borrows the advantage of proposal 2. In another word, this essentially makes the "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be added to the Graph and Pipeline if people want to. This also gives my thoughts regarding the concern that making the Transformer/Estimator to MIMO would break the convention of single input single output (SISO) Transformer/Estimator. Since this does not change the core semantic of Transformer/Estimator, it sounds an intuitive extension to me. 2. Another semantic related case brought up was heterogeneous topologies in training and inference. In that case, the input of an Estimator would be different from the input of the transformer returned by Estimator.fit(). The example to this case is Word2Vec, where the input of the Estimator would be an article while the input to the Transformer is a single word. The well recognized ML Pipeline doesn't seem to support this case, because it assumes the input of the Estimator and corresponding Transformer are the same. Both proposal 1 and proposal 2 leaves this case unsupported in the Pipeline. To support this case, - Proposal 1 adds support to such cases in the Graph/GraphModel by introducing "EstimatorInput" and "TransformerInput". The downside is that it complicates the API. - Proposal 2 leaves this to users to construct two different DAG for training and inference respectively. This means users would have to construct the DAG twice even if most parts of the DAG are the same in training and inference. My gut feeling is that this is not a critical difference because such heterogeneous topology is sort of a corner case. Most users do not need to worry about this. For those who do need this, either proposal 1 and proposal 2 seems acceptable to me. That said, it looks that with proposal 1, users can still choose to write the program twice without using the Graph API, just like what they do in proposal 2. So technically speaking, proposal 1 is more flexible and allows users to choose either flavor. On the other hand, one could argue that proposal 1 may confuse users with these two flavors. Although personally I feel it is clear to me, I am open to other ideas. 3. Lastly, there was a concern about proposal 1 is that some Estimators can no longer be added to the Pipeline while the original Pipeline accepts any Estimator. It seems that users have to always make sure the input schema required by the Estimator matches the input table. So even for the existing Pipeline, people cannot naively add any Estimator into a pipeline. Admittedly, proposal 1 added some more requirements, namely 1) the number of inputs needs to match the number of outputs of the previous stage, and 2) the Estimator does not generate a transformer with different required input schema (the heterogeneous case mentioned above). However, given that these mismatches will result in exceptions at compile time, just like users put an Estimator with mismatched input schema, personally I find it does not change the user experience much. So to summarize my thoughts on this fundamental difference. - In general, I personally prefer having one set of API. - The current proposal 1 may need some improvements in some cases, by borrowing something from proposal 2. A few other differences that I consider as non-fundamental: * Do we need a top level encapsulation API for an Algorithm? * Proposal 1 has a concept of Graph which encapsulates the entire algorithm to provide a unified API following the same semantic of Estimator/Transformer. Users can choose not to package everything into a Graph, but just write their own program and wrap it as an ordinary function. Proposal 2 does not have the top level API such as Graph. Instead, users can choose to write an arbitrary function if they want to. >From what I understand, in proposal 1, users may still choose to ignore Graph API and simply construct a DAG by themselves by calling transform() and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as I suggested earlier. So Graph is just an additional way to construct a graph - people can use Graph in a similar way as they do to the Pipeline/Pipeline model. In another word, there is no conflict between proposal 1 and proposal 2. * The ways to describe a Graph? * Proposal 1 gives two ways to construct a DAG. 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well). 2. using the GraphBuilder API. Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a main output and some other side outputs, so it can call algoOp1.linkFrom(algoOp2) without specifying the index of the output, at the cost of wrapping all the Tables into an AlgoOperator. The usability argument was mostly around the raw APIs. I don't think the two APIs differ too much from each other. With the same assumption, proposal 1 and proposal 2 can probably achieve very similar levels of usability when describing a Graph, if not exactly the same. There are some more other differences/arguments mentioned between the two proposals. However, I don't think they are fundamental. And just like the cases mentioned above, the two proposals can easily learn from each other. Thanks, Jiangjie (Becket) Qin On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote: > Hi all, > > Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two different > designs to extend Flink ML API to support more use-cases, e.g. expressing a > DAG of preprocessing and training logics. These two designs have been > documented in FLIP-173 > < > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783 > > > 。 > > We have different opinions on the usability and the ease-of-understanding > of the proposed APIs. It will be really useful to have comments of those > designs from the open source community and to learn your preferences. > > To facilitate the discussion, we have summarized our design principles and > opinions in this Google doc > < > https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do > >. > Code snippets for a few example use-cases are also provided in this doc to > demonstrate the difference between these two solutions. > > This Flink ML API is super important to the future of Flink ML library. > Please feel free to reply to this email thread or comment in the Google doc > directly. > > Thank you! > Dong, Zhipeng, Fan >