Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Fan Hong Thu, 19 Aug 2021 07:02:16 -0700

Hi, Mingliang and Becket,


Thank you for providing a real-world case of heterogeneous topology in the
training and inference phase, and Becket has given two options to you to
choose.


Personally, I think Becket's two options are over-simplified in
description, and may be somehow misleading.

Here, I would like add some of my thoughts:

   1. Proposal Option-2 does *NOT* have to implement two DAGs in *ALL*
    cases.
   In most cases, the best practice in Proposal Option-2 is to put the
   common part (inference part) into a Pipeline. In the training phase, the
   data is preprocessed by AlgoOps or another pipeline, and then fed to
   Pipeline.fit().  The output PipelineModel can be directly used in the
   inference phase. The code will be much clearer and cleaner than the
   complicated manipulation of *estimatorInputs* and *transformerInput* in
   the Graph API.
   2. Proposal Option-1 can *NOT ALWAYS* encapsulate the heterogeneous
   topology with the Graph/GraphBuilder API.  In [1], we already list some
   cases where Graph API failed to encapsulate the complicated topology, and
   we also presented concrete scenarios we encountered. And such
   incapability could bring extra effort when incremental developing your ML
   task.
   3. *EVEN IF* Mingliang's cases happened to be in the rare positions
   where Becket's two options applied, In [1], the actual differences between
   two options are shown in code snippets. I personally do not think
   implementing two DAGs brings much overhead. You may check those code
   snippets if you would like.


As far as I can see, most inference/predict pipelines are used for online
serving (as in offline inference, there is no need to export models). In
the situation of online serving, the corresponding pipeline can only accept
1 dataset and produce 1 dataset. It means the item-1 above applies:
Proposal Option-2 does the same thing in a clear and clean way.


So, Mingliang, if it does not bother you much, you may give more
information about your scenarios, and may think with supplementary
information above.


[1]
https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do


Sincerely,

Fan Hong

On Fri, Aug 6, 2021 at 3:56 PM Becket Qin <[email protected]> wrote:

> Hi Zhipeng,
>
> Yes, I agree that the key difference between the two options is how they
> support MIMO.
>
> My main concern for option 2 is potential inconsistent availability of
> algorithms in the two sets of API. In order to make an algorithm available
> to both sets of API, people have to implement the same algorithm with two
> different APIs. And sometimes an algorithm available as AlgoOps may not
> exist as a Transformer. This seems pretty confusing to the users.
>
> Therefore, personally speaking, I am in favor of one set of API. Also I
> feel that the MIMO extension of the Transformer/Estimator API is still
> quite intuitive. However, I understand that others may think differently.
> So it would be good to see the opinion from more people.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Jul 20, 2021 at 7:10 PM Zhipeng Zhang <[email protected]>
> wrote:
>
>> Hi Becket,
>>
>> Thanks for the review! I totally agree that it would be easier for people
>> to discuss if we can list the fundamental difference between these two
>> proposals. (So I want to make the discussion even shorter)
>>
>> In my opinion, the fundamental difference between proposal-1 and
>> proposal-2 is how they support the multi-input multi-output (MIMO) machine
>> learning algorithms.
>>
>> Proposal-1 supports MIMO by extending Transformer/Estimator/Pipeline to
>> take multiple inputs and output multiple outputs.
>>
>> Proposal-2 does not change the definition of
>> Transformer/Estimator/Pipeline. Rather, to support MIMO it comes up with a
>> new abstraction --- AlgoOperator, which is essentially an abstraction for
>> machine learning functions. That is, proposal-2 employs well-recognized
>> Transformer/Estimator/Pipeline to support single-input single-output (SISO)
>> machine learning algorithms and AlgoOperator to support MIMO.
>>
>> In my opinion, the benefit of proposal-1 is that there is only one set of
>> API and it is clean. However, it breaks the user habits (SISO for
>> Transformer/Estimator/Pipeline). Users have to think more than before when
>> using the new Transformer/Estimator/Pipeline. [1]
>>
>> The benefit of proposal-2 is that it does not change anything of the
>> well-recognized Transformer/Estimator/Pipeline and existing users (e.g.,
>> Spark MLlib users) would be happy.
>> However, as you mentioned, proposal-2 introduces a new abstraction
>> (AlgoOperator), which may increase the burden for understanding.
>>
>> [1]
>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do/edit#heading=h.c2qr9r64btd9
>>
>>
>> Thanks,
>>
>> Zhipeng Zhang
>>
>> Becket Qin <[email protected]> 于2021年7月20日周二 上午11:42写道：
>>
>>> Hi Dong, Zhipeng and Fan,
>>>
>>> Thanks for the detailed proposals. It is quite a lot of reading! Given
>>> that we are introducing a lot of stuff here, I find that it might be easier
>>> for people to discuss if we can list the fundamental differences first.
>>> From what I understand, the very fundamental difference between the two
>>> proposals is following:
>>>
>>> * In order to support graph structure, do we extend
>>> Transformer/Estimator, or do we introduce a new set of API? *
>>>
>>> Proposal 1 tries to keep one set of API, which is based on
>>> Transformer/Estimator/Pipeline. More specifically, it does the following:
>>>     - Make Transformer and Estimator multi-input and multi-output
>>> (MIMO).
>>>     - Introduce a Graph/GraphModel as counter parts of
>>> Pipeline/PipelineModel.
>>>
>>> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
>>> Instead, it introduces AlgoOperators to support the graph structure. The
>>> AlgoOperators are general-purpose graph nodes supporting MIMO. They are
>>> independent of Pipeline / Transformer / Estimator.
>>>
>>>
>>> My two cents:
>>>
>>> I think it is a big advantage to have a single set of API rather than
>>> two independent sets of API, if possible. But I would suggest we change the
>>> current proposal 1 a little bit, by learning from proposal 2.
>>>
>>> What I like about proposal 1:
>>> 1. A single set of API, symmetric in Graph/GraphModel and
>>> Pipeline/PipelineModel.
>>> 2. Keeping most of the benefits from Transformer/Estimator, including
>>> the fit-then-transform relation and save/load capability.
>>>
>>> However, proposal 1 also introduced some changes that I am not sure
>>> about:
>>>
>>> 1. The most controversial part of proposal 1 is whether we should extend
>>> the Transformer/Estimator/Pipeline? In fact, different projects have
>>> slightly different designs for Transformer/Estimator/Pipeline. So I think
>>> it is OK to extend it. However, there are some commonly recognized core
>>> semantics that we ideally want to keep. To me these core semantics are:
>>>   1. Transformer is a Data -> Data conversion, Estimator deals with Data
>>> -> Model conversion.
>>>   2. Estimator.fit() gives a Transformer, and users can just call
>>> Transformer.transform() to perform inference.
>>> To me, as long as these core semantics are kept, extension to the API
>>> seems acceptable.
>>>
>>> Proposal 1 extends the semantic of Transformer from Data -> Data
>>> conversion to generic Table -> Table conversion, and claims it is
>>> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph node.
>>> It does change the first semantic. That said, this might just be a naming
>>> problem, though. One possible solution to this problem is having a new
>>> subclass of Stage without strong conventional semantics, e.g. "AlgoOp" if
>>> we borrow the name from proposal 2, and let Transformer extend it. Just
>>> like a PipelineModel is a more specific Transformer, a Transformer would be
>>> a more specific "AlgoOp". If we do that, the processing logic that people
>>> don't feel comfortable to be a Transformer can just be put into an "AlgoOp"
>>> and thus can still be added to a Pipeline / Graph. This borrows the
>>> advantage of proposal 2. In another word, this essentially makes the
>>> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be
>>> added to the Graph and Pipeline if people want to.
>>>
>>> This also gives my thoughts regarding the concern that making the
>>> Transformer/Estimator to MIMO would break the convention of single input
>>> single output (SISO) Transformer/Estimator. Since this does not change the
>>> core semantic of Transformer/Estimator, it sounds an intuitive extension to
>>> me.
>>>
>>> 2. Another semantic related case brought up was heterogeneous topologies
>>> in training and inference. In that case, the input of an Estimator would be
>>> different from the input of the transformer returned by Estimator.fit().
>>> The example to this case is Word2Vec, where the input of the Estimator
>>> would be an article while the input to the Transformer is a single word.
>>> The well recognized ML Pipeline doesn't seem to support this case, because
>>> it assumes the input of the Estimator and corresponding Transformer are the
>>> same.
>>>
>>> Both proposal 1 and proposal 2 leaves this case unsupported in the
>>> Pipeline. To support this case,
>>>    - Proposal 1 adds support to such cases in the Graph/GraphModel by
>>> introducing "EstimatorInput" and "TransformerInput". The downside is that
>>> it complicates the API.
>>>    - Proposal 2 leaves this to users to construct two different DAG for
>>> training and inference respectively. This means users would have to
>>> construct the DAG twice even if most parts of the DAG are the same in
>>> training and inference.
>>>
>>> My gut feeling is that this is not a critical difference because such
>>> heterogeneous topology is sort of a corner case. Most users do not need to
>>> worry about this. For those who do need this, either proposal 1 and
>>> proposal 2 seems acceptable to me. That said, it looks that with proposal
>>> 1, users can still choose to write the program twice without using the
>>> Graph API, just like what they do in proposal 2. So technically speaking,
>>> proposal 1 is more flexible and allows users to choose either flavor. On
>>> the other hand, one could argue that proposal 1 may confuse users with
>>> these two flavors. Although personally I feel it is clear to me, I am open
>>> to other ideas.
>>>
>>> 3. Lastly, there was a concern about proposal 1 is that some Estimators
>>> can no longer be added to the Pipeline while the original Pipeline accepts
>>> any Estimator.
>>>
>>> It seems that users have to always make sure the input schema required
>>> by the Estimator matches the input table. So even for the existing
>>> Pipeline, people cannot naively add any Estimator into a pipeline.
>>> Admittedly, proposal 1 added some more requirements, namely 1) the number
>>> of inputs needs to match the number of outputs of the previous stage, and
>>> 2) the Estimator does not generate a transformer with different required
>>> input schema (the heterogeneous case mentioned above). However, given that
>>> these mismatches will result in exceptions at compile time, just like users
>>> put an Estimator with mismatched input schema, personally I find it does
>>> not change the user experience much.
>>>
>>>
>>> So to summarize my thoughts on this fundamental difference.
>>>     - In general, I personally prefer having one set of API.
>>>     - The current proposal 1 may need some improvements in some cases,
>>> by borrowing something from proposal 2.
>>>
>>>
>>>
>>> A few other differences that I consider as non-fundamental:
>>>
>>> * Do we need a top level encapsulation API for an Algorithm? *
>>>
>>> Proposal 1 has a concept of Graph which encapsulates the entire
>>> algorithm to provide a unified API following the same semantic of
>>> Estimator/Transformer. Users can choose not to package everything into a
>>> Graph, but just write their own program and wrap it as an ordinary function.
>>>
>>> Proposal 2 does not have the top level API such as Graph. Instead, users
>>> can choose to write an arbitrary function if they want to.
>>>
>>> From what I understand, in proposal 1, users may still choose to ignore
>>> Graph API and simply construct a DAG by themselves by calling transform()
>>> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as
>>> I suggested earlier. So Graph is just an additional way to construct a
>>> graph - people can use Graph in a similar way as they do to the
>>> Pipeline/Pipeline model. In another word, there is no conflict between
>>> proposal 1 and proposal 2.
>>>
>>>
>>> * The ways to describe a Graph? *
>>>
>>> Proposal 1 gives two ways to construct a DAG.
>>> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as
>>> well).
>>> 2. using the GraphBuilder API.
>>>
>>> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
>>> main output and some other side outputs, so it can call
>>> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
>>> the cost of wrapping all the Tables into an AlgoOperator.
>>>
>>> The usability argument was mostly around the raw APIs. I don't think the
>>> two APIs differ too much from each other. With the same assumption,
>>> proposal 1 and proposal 2 can probably achieve very similar levels of
>>> usability when describing a Graph, if not exactly the same.
>>>
>>>
>>> There are some more other differences/arguments mentioned between the
>>> two proposals. However, I don't think they are fundamental. And just like
>>> the cases mentioned above, the two proposals can easily learn from each
>>> other.
>>>
>>> Thanks,
>>>
>>> Jiangjie (Becket) Qin
>>>
>>> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
>>>> different
>>>> designs to extend Flink ML API to support more use-cases, e.g.
>>>> expressing a
>>>> DAG of preprocessing and training logics. These two designs have been
>>>> documented in FLIP-173
>>>> <
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
>>>> >
>>>> 。
>>>>
>>>> We have different opinions on the usability and the
>>>> ease-of-understanding
>>>> of the proposed APIs. It will be really useful to have comments of those
>>>> designs from the open source community and to learn your preferences.
>>>>
>>>> To facilitate the discussion, we have summarized our design principles
>>>> and
>>>> opinions in this Google doc
>>>> <
>>>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
>>>> >.
>>>> Code snippets for a few example use-cases are also provided in this doc
>>>> to
>>>> demonstrate the difference between these two solutions.
>>>>
>>>> This Flink ML API is super important to the future of Flink ML library.
>>>> Please feel free to reply to this email thread or comment in the Google
>>>> doc
>>>> directly.
>>>>
>>>> Thank you!
>>>> Dong, Zhipeng, Fan
>>>>
>>>
>>
>> --
>> best,
>> Zhipeng
>>
>>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to