Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Zhipeng Zhang Tue, 20 Jul 2021 04:10:23 -0700

Hi Becket,

Thanks for the review! I totally agree that it would be easier for people
to discuss if we can list the fundamental difference between these two
proposals. (So I want to make the discussion even shorter)


In my opinion, the fundamental difference between proposal-1 and proposal-2
is how they support the multi-input multi-output (MIMO) machine learning
algorithms.

Proposal-1 supports MIMO by extending Transformer/Estimator/Pipeline to
take multiple inputs and output multiple outputs.

Proposal-2 does not change the definition of
Transformer/Estimator/Pipeline. Rather, to support MIMO it comes up with a
new abstraction --- AlgoOperator, which is essentially an abstraction for
machine learning functions. That is, proposal-2 employs well-recognized
Transformer/Estimator/Pipeline to support single-input single-output (SISO)
machine learning algorithms and AlgoOperator to support MIMO.

In my opinion, the benefit of proposal-1 is that there is only one set of
API and it is clean. However, it breaks the user habits (SISO for
Transformer/Estimator/Pipeline). Users have to think more than before when
using the new Transformer/Estimator/Pipeline. [1]

The benefit of proposal-2 is that it does not change anything of the
well-recognized Transformer/Estimator/Pipeline and existing users (e.g.,
Spark MLlib users) would be happy.
However, as you mentioned, proposal-2 introduces a new abstraction
(AlgoOperator), which may increase the burden for understanding.

[1]
https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do/edit#heading=h.c2qr9r64btd9


Thanks,

Zhipeng Zhang

Becket Qin <becket....@gmail.com> 于2021年7月20日周二 上午11:42写道：

> Hi Dong, Zhipeng and Fan,
>
> Thanks for the detailed proposals. It is quite a lot of reading! Given
> that we are introducing a lot of stuff here, I find that it might be easier
> for people to discuss if we can list the fundamental differences first.
> From what I understand, the very fundamental difference between the two
> proposals is following:
>
> * In order to support graph structure, do we extend Transformer/Estimator,
> or do we introduce a new set of API? *
>
> Proposal 1 tries to keep one set of API, which is based on
> Transformer/Estimator/Pipeline. More specifically, it does the following:
>     - Make Transformer and Estimator multi-input and multi-output (MIMO).
>     - Introduce a Graph/GraphModel as counter parts of
> Pipeline/PipelineModel.
>
> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
> Instead, it introduces AlgoOperators to support the graph structure. The
> AlgoOperators are general-purpose graph nodes supporting MIMO. They are
> independent of Pipeline / Transformer / Estimator.
>
>
> My two cents:
>
> I think it is a big advantage to have a single set of API rather than two
> independent sets of API, if possible. But I would suggest we change the
> current proposal 1 a little bit, by learning from proposal 2.
>
> What I like about proposal 1:
> 1. A single set of API, symmetric in Graph/GraphModel and
> Pipeline/PipelineModel.
> 2. Keeping most of the benefits from Transformer/Estimator, including the
> fit-then-transform relation and save/load capability.
>
> However, proposal 1 also introduced some changes that I am not sure about:
>
> 1. The most controversial part of proposal 1 is whether we should extend
> the Transformer/Estimator/Pipeline? In fact, different projects have
> slightly different designs for Transformer/Estimator/Pipeline. So I think
> it is OK to extend it. However, there are some commonly recognized core
> semantics that we ideally want to keep. To me these core semantics are:
>   1. Transformer is a Data -> Data conversion, Estimator deals with Data
> -> Model conversion.
>   2. Estimator.fit() gives a Transformer, and users can just call
> Transformer.transform() to perform inference.
> To me, as long as these core semantics are kept, extension to the API
> seems acceptable.
>
> Proposal 1 extends the semantic of Transformer from Data -> Data
> conversion to generic Table -> Table conversion, and claims it is
> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph node.
> It does change the first semantic. That said, this might just be a naming
> problem, though. One possible solution to this problem is having a new
> subclass of Stage without strong conventional semantics, e.g. "AlgoOp" if
> we borrow the name from proposal 2, and let Transformer extend it. Just
> like a PipelineModel is a more specific Transformer, a Transformer would be
> a more specific "AlgoOp". If we do that, the processing logic that people
> don't feel comfortable to be a Transformer can just be put into an "AlgoOp"
> and thus can still be added to a Pipeline / Graph. This borrows the
> advantage of proposal 2. In another word, this essentially makes the
> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be
> added to the Graph and Pipeline if people want to.
>
> This also gives my thoughts regarding the concern that making the
> Transformer/Estimator to MIMO would break the convention of single input
> single output (SISO) Transformer/Estimator. Since this does not change the
> core semantic of Transformer/Estimator, it sounds an intuitive extension to
> me.
>
> 2. Another semantic related case brought up was heterogeneous topologies
> in training and inference. In that case, the input of an Estimator would be
> different from the input of the transformer returned by Estimator.fit().
> The example to this case is Word2Vec, where the input of the Estimator
> would be an article while the input to the Transformer is a single word.
> The well recognized ML Pipeline doesn't seem to support this case, because
> it assumes the input of the Estimator and corresponding Transformer are the
> same.
>
> Both proposal 1 and proposal 2 leaves this case unsupported in the
> Pipeline. To support this case,
>    - Proposal 1 adds support to such cases in the Graph/GraphModel by
> introducing "EstimatorInput" and "TransformerInput". The downside is that
> it complicates the API.
>    - Proposal 2 leaves this to users to construct two different DAG for
> training and inference respectively. This means users would have to
> construct the DAG twice even if most parts of the DAG are the same in
> training and inference.
>
> My gut feeling is that this is not a critical difference because such
> heterogeneous topology is sort of a corner case. Most users do not need to
> worry about this. For those who do need this, either proposal 1 and
> proposal 2 seems acceptable to me. That said, it looks that with proposal
> 1, users can still choose to write the program twice without using the
> Graph API, just like what they do in proposal 2. So technically speaking,
> proposal 1 is more flexible and allows users to choose either flavor. On
> the other hand, one could argue that proposal 1 may confuse users with
> these two flavors. Although personally I feel it is clear to me, I am open
> to other ideas.
>
> 3. Lastly, there was a concern about proposal 1 is that some Estimators
> can no longer be added to the Pipeline while the original Pipeline accepts
> any Estimator.
>
> It seems that users have to always make sure the input schema required by
> the Estimator matches the input table. So even for the existing Pipeline,
> people cannot naively add any Estimator into a pipeline. Admittedly,
> proposal 1 added some more requirements, namely 1) the number of inputs
> needs to match the number of outputs of the previous stage, and 2) the
> Estimator does not generate a transformer with different required input
> schema (the heterogeneous case mentioned above). However, given that these
> mismatches will result in exceptions at compile time, just like users put
> an Estimator with mismatched input schema, personally I find it does not
> change the user experience much.
>
>
> So to summarize my thoughts on this fundamental difference.
>     - In general, I personally prefer having one set of API.
>     - The current proposal 1 may need some improvements in some cases, by
> borrowing something from proposal 2.
>
>
>
> A few other differences that I consider as non-fundamental:
>
> * Do we need a top level encapsulation API for an Algorithm? *
>
> Proposal 1 has a concept of Graph which encapsulates the entire algorithm
> to provide a unified API following the same semantic of
> Estimator/Transformer. Users can choose not to package everything into a
> Graph, but just write their own program and wrap it as an ordinary function.
>
> Proposal 2 does not have the top level API such as Graph. Instead, users
> can choose to write an arbitrary function if they want to.
>
> From what I understand, in proposal 1, users may still choose to ignore
> Graph API and simply construct a DAG by themselves by calling transform()
> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as
> I suggested earlier. So Graph is just an additional way to construct a
> graph - people can use Graph in a similar way as they do to the
> Pipeline/Pipeline model. In another word, there is no conflict between
> proposal 1 and proposal 2.
>
>
> * The ways to describe a Graph? *
>
> Proposal 1 gives two ways to construct a DAG.
> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well).
> 2. using the GraphBuilder API.
>
> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
> main output and some other side outputs, so it can call
> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
> the cost of wrapping all the Tables into an AlgoOperator.
>
> The usability argument was mostly around the raw APIs. I don't think the
> two APIs differ too much from each other. With the same assumption,
> proposal 1 and proposal 2 can probably achieve very similar levels of
> usability when describing a Graph, if not exactly the same.
>
>
> There are some more other differences/arguments mentioned between the two
> proposals. However, I don't think they are fundamental. And just like the
> cases mentioned above, the two proposals can easily learn from each other.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote:
>
>> Hi all,
>>
>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
>> different
>> designs to extend Flink ML API to support more use-cases, e.g. expressing
>> a
>> DAG of preprocessing and training logics. These two designs have been
>> documented in FLIP-173
>> <
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
>> >
>> 。
>>
>> We have different opinions on the usability and the ease-of-understanding
>> of the proposed APIs. It will be really useful to have comments of those
>> designs from the open source community and to learn your preferences.
>>
>> To facilitate the discussion, we have summarized our design principles and
>> opinions in this Google doc
>> <
>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
>> >.
>> Code snippets for a few example use-cases are also provided in this doc to
>> demonstrate the difference between these two solutions.
>>
>> This Flink ML API is super important to the future of Flink ML library.
>> Please feel free to reply to this email thread or comment in the Google
>> doc
>> directly.
>>
>> Thank you!
>> Dong, Zhipeng, Fan
>>
>

-- 
best,
Zhipeng

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to