Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Becket Qin Mon, 09 Aug 2021 18:40:01 -0700

Thanks Mingliang. It is super helpful to get your input. At this point,
there are two ways mentioned in the FLIP to support heterogeneous topology
in training and inference phase.


1. Create two separate DAGs or code for training and inference respectively.
2. An encapsulation API called Graph, which takes estimatorInputs for
training and transformerInput for inference. [1]

It is worth mentioning that the Graph API does not conflict with the first
way. It just combines the two separate DAGs into one for the heterogeneous
case we are talking about. And users may use fit and transform calls just
like what they did in the Pipeline. However, it does mean that users will
have to describe the Graph, rather than calling Transformer.transform() and
Estimator.fit() by themselves. This introduces some complexity.

I am curious which way would you prefer when building your system?

Thanks,

Jiangjie (Becket) Qin

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181311360&moved=true#:~:text=%C2%A0%C2%A0%C2%A0%C2%A0Graph%20build(TableId%5B%5D%20estimatorInputs%2C%20TableId%5B%5D%20transformerInputs%2C%20TableId%5B%5D%20outputs%2C%20TableId%5B%5D%20inputStates%2C%20TableId%5B%5D%20outputStates)%20%7B...%7D

On Mon, Aug 9, 2021 at 7:46 PM 青雉（祁明良） <m...@xiaohongshu.com> wrote:

> Hi all,
>
> This is mingliang, a machine learning engineer in recommendation area.
> I see there’s discussion about “heterogeneous topologies in training and
> inference.” Actually this is a very common case in recommendation system
> especially in CTR prediction tasks. For training task, usually data is
> compressed and optimized for storage issues, additional feature extraction
> and transformation will be applied before train task. And for online
> inference, data is usually well prepared from recommendation engine and
> optimized for latency issues.
>
> Best,
> Mingliang
>
> > On Aug 6, 2021, at 3:55 PM, Becket Qin <becket....@gmail.com> wrote:
> >
> > Hi Zhipeng,
> >
> > Yes, I agree that the key difference between the two options is how they
> > support MIMO.
> >
> > My main concern for option 2 is potential inconsistent availability of
> > algorithms in the two sets of API. In order to make an algorithm
> available
> > to both sets of API, people have to implement the same algorithm with two
> > different APIs. And sometimes an algorithm available as AlgoOps may not
> > exist as a Transformer. This seems pretty confusing to the users.
> >
> > Therefore, personally speaking, I am in favor of one set of API. Also I
> > feel that the MIMO extension of the Transformer/Estimator API is still
> > quite intuitive. However, I understand that others may think differently.
> > So it would be good to see the opinion from more people.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Jul 20, 2021 at 7:10 PM Zhipeng Zhang <zhangzhipe...@gmail.com>
> > wrote:
> >
> >> Hi Becket,
> >>
> >> Thanks for the review! I totally agree that it would be easier for
> people
> >> to discuss if we can list the fundamental difference between these two
> >> proposals. (So I want to make the discussion even shorter)
> >>
> >> In my opinion, the fundamental difference between proposal-1 and
> >> proposal-2 is how they support the multi-input multi-output (MIMO)
> machine
> >> learning algorithms.
> >>
> >> Proposal-1 supports MIMO by extending Transformer/Estimator/Pipeline to
> >> take multiple inputs and output multiple outputs.
> >>
> >> Proposal-2 does not change the definition of
> >> Transformer/Estimator/Pipeline. Rather, to support MIMO it comes up
> with a
> >> new abstraction --- AlgoOperator, which is essentially an abstraction
> for
> >> machine learning functions. That is, proposal-2 employs well-recognized
> >> Transformer/Estimator/Pipeline to support single-input single-output
> (SISO)
> >> machine learning algorithms and AlgoOperator to support MIMO.
> >>
> >> In my opinion, the benefit of proposal-1 is that there is only one set
> of
> >> API and it is clean. However, it breaks the user habits (SISO for
> >> Transformer/Estimator/Pipeline). Users have to think more than before
> when
> >> using the new Transformer/Estimator/Pipeline. [1]
> >>
> >> The benefit of proposal-2 is that it does not change anything of the
> >> well-recognized Transformer/Estimator/Pipeline and existing users (e.g.,
> >> Spark MLlib users) would be happy.
> >> However, as you mentioned, proposal-2 introduces a new abstraction
> >> (AlgoOperator), which may increase the burden for understanding.
> >>
> >> [1]
> >>
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do/edit#heading=h.c2qr9r64btd9
> >>
> >>
> >> Thanks,
> >>
> >> Zhipeng Zhang
> >>
> >> Becket Qin <becket....@gmail.com> 于2021年7月20日周二 上午11:42写道：
> >>
> >>> Hi Dong, Zhipeng and Fan,
> >>>
> >>> Thanks for the detailed proposals. It is quite a lot of reading! Given
> >>> that we are introducing a lot of stuff here, I find that it might be
> easier
> >>> for people to discuss if we can list the fundamental differences first.
> >>> From what I understand, the very fundamental difference between the two
> >>> proposals is following:
> >>>
> >>> * In order to support graph structure, do we extend
> >>> Transformer/Estimator, or do we introduce a new set of API? *
> >>>
> >>> Proposal 1 tries to keep one set of API, which is based on
> >>> Transformer/Estimator/Pipeline. More specifically, it does the
> following:
> >>>    - Make Transformer and Estimator multi-input and multi-output
> (MIMO).
> >>>    - Introduce a Graph/GraphModel as counter parts of
> >>> Pipeline/PipelineModel.
> >>>
> >>> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
> >>> Instead, it introduces AlgoOperators to support the graph structure.
> The
> >>> AlgoOperators are general-purpose graph nodes supporting MIMO. They are
> >>> independent of Pipeline / Transformer / Estimator.
> >>>
> >>>
> >>> My two cents:
> >>>
> >>> I think it is a big advantage to have a single set of API rather than
> two
> >>> independent sets of API, if possible. But I would suggest we change the
> >>> current proposal 1 a little bit, by learning from proposal 2.
> >>>
> >>> What I like about proposal 1:
> >>> 1. A single set of API, symmetric in Graph/GraphModel and
> >>> Pipeline/PipelineModel.
> >>> 2. Keeping most of the benefits from Transformer/Estimator, including
> the
> >>> fit-then-transform relation and save/load capability.
> >>>
> >>> However, proposal 1 also introduced some changes that I am not sure
> about:
> >>>
> >>> 1. The most controversial part of proposal 1 is whether we should
> extend
> >>> the Transformer/Estimator/Pipeline? In fact, different projects have
> >>> slightly different designs for Transformer/Estimator/Pipeline. So I
> think
> >>> it is OK to extend it. However, there are some commonly recognized core
> >>> semantics that we ideally want to keep. To me these core semantics are:
> >>>  1. Transformer is a Data -> Data conversion, Estimator deals with Data
> >>> -> Model conversion.
> >>>  2. Estimator.fit() gives a Transformer, and users can just call
> >>> Transformer.transform() to perform inference.
> >>> To me, as long as these core semantics are kept, extension to the API
> >>> seems acceptable.
> >>>
> >>> Proposal 1 extends the semantic of Transformer from Data -> Data
> >>> conversion to generic Table -> Table conversion, and claims it is
> >>> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph
> node.
> >>> It does change the first semantic. That said, this might just be a
> naming
> >>> problem, though. One possible solution to this problem is having a new
> >>> subclass of Stage without strong conventional semantics, e.g. "AlgoOp"
> if
> >>> we borrow the name from proposal 2, and let Transformer extend it. Just
> >>> like a PipelineModel is a more specific Transformer, a Transformer
> would be
> >>> a more specific "AlgoOp". If we do that, the processing logic that
> people
> >>> don't feel comfortable to be a Transformer can just be put into an
> "AlgoOp"
> >>> and thus can still be added to a Pipeline / Graph. This borrows the
> >>> advantage of proposal 2. In another word, this essentially makes the
> >>> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to
> be
> >>> added to the Graph and Pipeline if people want to.
> >>>
> >>> This also gives my thoughts regarding the concern that making the
> >>> Transformer/Estimator to MIMO would break the convention of single
> input
> >>> single output (SISO) Transformer/Estimator. Since this does not change
> the
> >>> core semantic of Transformer/Estimator, it sounds an intuitive
> extension to
> >>> me.
> >>>
> >>> 2. Another semantic related case brought up was heterogeneous
> topologies
> >>> in training and inference. In that case, the input of an Estimator
> would be
> >>> different from the input of the transformer returned by
> Estimator.fit().
> >>> The example to this case is Word2Vec, where the input of the Estimator
> >>> would be an article while the input to the Transformer is a single
> word.
> >>> The well recognized ML Pipeline doesn't seem to support this case,
> because
> >>> it assumes the input of the Estimator and corresponding Transformer
> are the
> >>> same.
> >>>
> >>> Both proposal 1 and proposal 2 leaves this case unsupported in the
> >>> Pipeline. To support this case,
> >>>   - Proposal 1 adds support to such cases in the Graph/GraphModel by
> >>> introducing "EstimatorInput" and "TransformerInput". The downside is
> that
> >>> it complicates the API.
> >>>   - Proposal 2 leaves this to users to construct two different DAG for
> >>> training and inference respectively. This means users would have to
> >>> construct the DAG twice even if most parts of the DAG are the same in
> >>> training and inference.
> >>>
> >>> My gut feeling is that this is not a critical difference because such
> >>> heterogeneous topology is sort of a corner case. Most users do not
> need to
> >>> worry about this. For those who do need this, either proposal 1 and
> >>> proposal 2 seems acceptable to me. That said, it looks that with
> proposal
> >>> 1, users can still choose to write the program twice without using the
> >>> Graph API, just like what they do in proposal 2. So technically
> speaking,
> >>> proposal 1 is more flexible and allows users to choose either flavor.
> On
> >>> the other hand, one could argue that proposal 1 may confuse users with
> >>> these two flavors. Although personally I feel it is clear to me, I am
> open
> >>> to other ideas.
> >>>
> >>> 3. Lastly, there was a concern about proposal 1 is that some Estimators
> >>> can no longer be added to the Pipeline while the original Pipeline
> accepts
> >>> any Estimator.
> >>>
> >>> It seems that users have to always make sure the input schema required
> by
> >>> the Estimator matches the input table. So even for the existing
> Pipeline,
> >>> people cannot naively add any Estimator into a pipeline. Admittedly,
> >>> proposal 1 added some more requirements, namely 1) the number of inputs
> >>> needs to match the number of outputs of the previous stage, and 2) the
> >>> Estimator does not generate a transformer with different required input
> >>> schema (the heterogeneous case mentioned above). However, given that
> these
> >>> mismatches will result in exceptions at compile time, just like users
> put
> >>> an Estimator with mismatched input schema, personally I find it does
> not
> >>> change the user experience much.
> >>>
> >>>
> >>> So to summarize my thoughts on this fundamental difference.
> >>>    - In general, I personally prefer having one set of API.
> >>>    - The current proposal 1 may need some improvements in some cases,
> by
> >>> borrowing something from proposal 2.
> >>>
> >>>
> >>>
> >>> A few other differences that I consider as non-fundamental:
> >>>
> >>> * Do we need a top level encapsulation API for an Algorithm? *
> >>>
> >>> Proposal 1 has a concept of Graph which encapsulates the entire
> algorithm
> >>> to provide a unified API following the same semantic of
> >>> Estimator/Transformer. Users can choose not to package everything into
> a
> >>> Graph, but just write their own program and wrap it as an ordinary
> function.
> >>>
> >>> Proposal 2 does not have the top level API such as Graph. Instead,
> users
> >>> can choose to write an arbitrary function if they want to.
> >>>
> >>> From what I understand, in proposal 1, users may still choose to ignore
> >>> Graph API and simply construct a DAG by themselves by calling
> transform()
> >>> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal
> 1 as
> >>> I suggested earlier. So Graph is just an additional way to construct a
> >>> graph - people can use Graph in a similar way as they do to the
> >>> Pipeline/Pipeline model. In another word, there is no conflict between
> >>> proposal 1 and proposal 2.
> >>>
> >>>
> >>> * The ways to describe a Graph? *
> >>>
> >>> Proposal 1 gives two ways to construct a DAG.
> >>> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as
> well).
> >>> 2. using the GraphBuilder API.
> >>>
> >>> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
> >>> main output and some other side outputs, so it can call
> >>> algoOp1.linkFrom(algoOp2) without specifying the index of the output,
> at
> >>> the cost of wrapping all the Tables into an AlgoOperator.
> >>>
> >>> The usability argument was mostly around the raw APIs. I don't think
> the
> >>> two APIs differ too much from each other. With the same assumption,
> >>> proposal 1 and proposal 2 can probably achieve very similar levels of
> >>> usability when describing a Graph, if not exactly the same.
> >>>
> >>>
> >>> There are some more other differences/arguments mentioned between the
> two
> >>> proposals. However, I don't think they are fundamental. And just like
> the
> >>> cases mentioned above, the two proposals can easily learn from each
> other.
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
> >>>> different
> >>>> designs to extend Flink ML API to support more use-cases, e.g.
> >>>> expressing a
> >>>> DAG of preprocessing and training logics. These two designs have been
> >>>> documented in FLIP-173
> >>>> <
> >>>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
> >>>>>
> >>>> 。
> >>>>
> >>>> We have different opinions on the usability and the
> ease-of-understanding
> >>>> of the proposed APIs. It will be really useful to have comments of
> those
> >>>> designs from the open source community and to learn your preferences.
> >>>>
> >>>> To facilitate the discussion, we have summarized our design principles
> >>>> and
> >>>> opinions in this Google doc
> >>>> <
> >>>>
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
> >>>>> .
> >>>> Code snippets for a few example use-cases are also provided in this
> doc
> >>>> to
> >>>> demonstrate the difference between these two solutions.
> >>>>
> >>>> This Flink ML API is super important to the future of Flink ML
> library.
> >>>> Please feel free to reply to this email thread or comment in the
> Google
> >>>> doc
> >>>> directly.
> >>>>
> >>>> Thank you!
> >>>> Dong, Zhipeng, Fan
> >>>>
> >>>
> >>
> >> --
> >> best,
> >> Zhipeng
> >>
> >>
>
>
>
> 本邮件及其附件含有小红书公司的保密信息，仅限于发送给以上收件人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！
> This communication may contain privileged or other confidential
> information of Red. If you have received it in error, please advise the
> sender by reply e-mail and immediately delete the message and any
> attachments without copying or disclosing the contents. Thank you.
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to