Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Dong Lin Wed, 25 Aug 2021 05:58:52 -0700

Hi everyone,

Based on the feedback received in the online/offline discussion in the past
few weeks, we (Zhepeng, Fan, myself and a few other developers at Alibaba)
have reached agreement on the design to support DAG of algorithms. We have
merged the ideas from the intial two options into this FLIP-176
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783>
design
doc.


If you have comments on the latest design doc, please let us know!

Cheers,
Dong


On Mon, Aug 23, 2021 at 5:07 PM Becket Qin <becket....@gmail.com> wrote:

> Thanks for the comments, Fan. Please see the reply inline.
>
> On Thu, Aug 19, 2021 at 10:25 PM Fan Hong <hongfa...@gmail.com> wrote:
>
> > Hi, Becket,
> >
> > Many thanks to your detailed review. I agree that it is easier to involve
> > more people to discuss if fundamental differences are highlighted.
> >
> >
> > Here are some of my thoughts to help other people to think about these
> > differences. (correct me if those technique details are not right.)
> >
> >
> > 1. One set of API or not? May be not that important.
> >
> >
> > First of all, AlgoOperators and Pipeline / Transformer / Estimator in
> > Proposal 2 are absolutely *NOT* independent.
> >
> >
> > One may think they are independent, because they see Pipeline /
> > Transformer / Estimator are already in Flink ML Lib and AlgoOperators are
> > lately added in this proposal. But that's not true. If you check Alink[1]
> > where the idea of Proposal 2 originated, both of them have been presented
> > long ago, and they collaborate tightly.
> >
>
> > In the aspects of functionalities, they are also not independent. Their
> > relation is more like a two-level API to specify ML tasks: AlgoOperators
> is
> > a general-purpose level to represent any ML algorithms, while Pipeline /
> > Transformer / Estimator provides a higher-level API which enables
> wrapping
> > multiple ML algorithms together in a fit-transform way.
> >
>
> We probably need to first clarify what "independent" means here. Sure,
> users can always wrap the Transformer into an AlgoOperator, but users can
> basically wrap any code, any class into an AlgoOperator. And we wouldn't
> say AlgoOperator is not independent of any class, right? In my opinion, the
> two APIs are independent because even if we agree that Transformers are
> doing things that are conceptually a subset of what AlgoOperators do, a
> Transformer cannot be used as an AlgoOperator out of the box without
> wrapping. And even worse, a MIMO AlgoOperator cannot be wrapped into a
> Transformer / Estimator if these two APIs are SISO. So from what I see, in
> Option 2, these two APIs are independent from API design perspective.
>
> One could consider Flink DataStream - Table as an analogy to AlgoOperators
> > - Pipeline. The two-level APIs provides different functionalities to end
> > users, and the higher-level API will call lower-level of API in internal
> > implementation. I'm not saying the two-level API design in Proposal 2 is
> > good because Flink already did this. I just hope to help community people
> > to understand the relation between AlgoOperators and Pipeline.
> >
>
> I am not sure if it is accurate to say DataStream is a low-level API of
> Table. They are simply two different DSL, one for relational / SQL-like
> analytics paradigm, and the other for those who are more familiar with
> streaming applications. More importantly, they are designed to
> support conversion from one to the other out of the box, which is unlike
> Pipeline and AlgoOperators in proposal 2.
>
>
> > An additional usage and benefit of Pipeline API is that SISO
> PipelineModel
> > corresponds to a deployable unit for online serving exactly.
> >
> > In online serving, Flink runtime are usually avoided to achieve low
> > latency. So models have to be wrapped for transmission from Flink
> ecosystem
> > to a non-Flink one. Here is the place where the wrapping is really needed
> > and inevitable, because the serving service providers are usually
> expected
> > to be general to one type of models. Pipeline API in Proposal 2 target to
> > this scene exactly without complicated APIs.
> >
> > Yet, for offline or nearline inference, they can be completed in Flink
> > ecosystem. That's where Flink ML Lib still exists, so a loose wrapping
> > using AlgoOperators in Proposal 2 still works with not much overhead.
> >
>
> It seems that a MIMO transformer can easily support all SISO use cases,
> right? And there is zero overhead because users may not have to wrap
> AlgoOperators, but can just build a Pipeline directly by putting either
> Transformer or AlgoOperators into it, without worrying about whether they
> are interoperable.
>
>
> > At the same time, these two levels of APIs are not redundant in their
> > functionalities, they have to collaborate to build ML tasks.
> >
> > AlgoOperator API is self-consistent and self-complete in constructing ML
> > tasks, but if users are seeking to wrap a sequence of subtasks,
> especially
> > for online serving, Pipeline / Transformer / Estimator API is inevitable.
> > On the other side, Pipeline / Transformer / Estimator API lacks
> > completeness, even for the extended version plus Graph API in Proposal 1
> > (last case in [4]), so it cannot replace AlgoOperator API.
> >
> > One case of their collaboration lies in my response to Mingliang's
> > recommendation scenarios, where AlgoOperators + Pipeline can provide
> > cleaner usage than Graph API.
> >
>
> I think the link/linkFrom API is more like a convenient wrapper of
> fit/transform/compute. Functionality wise, they are equivalent. The
> Graph/GraphBuilder API, on the other hand, is an encapsulation design, on
> top of Estimator/Transformer/AlgoOperators. Without the GraphBuilder/Graph
> API, users would have to create their own class to encapsulate the code.
> Just like without the Pipeline API, users would have to create their own
> class to wrap the pipeline logic. So I don't think we should compare
> link/linkFrom with Graph/GraphBuilder API because they serve different
> purposes. Even though both of them need to describe a DAG, GraphBuilder is
> describing for encapsulation while link/linkFrom is not.
>
>
> >
> > 2. What is core semantics of Pipeline / Transformer / Estimator?
> >
> >
> > I will not give my answer because I can't. I think it would be difficult
> > to reach an agreement on this.
> >
> > But I did two things, and hope they can provide some hints.
> >
> >
> > One thing is to seek answers from other ML libraries. Scikit-learn and
> > SparkML are well-known general-purpose ML libraries.
> >
> > Spark ML gives the definition of Pipeline / Transformer / Estimator in
> its
> > documents. Here I quote as follows [2]:
> >
> >
> > *Transformer*
> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#transformers>:
> >> A Transformer is an algorithm which can transform *one* DataFrame into
> >> *another* DataFrame. E.g., an ML model is a Transformer which transforms
> >> *a* DataFrame with features into *a* DataFrame with predictions.
> >> *Estimator*
> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#estimators>:
> >> An Estimator is an algorithm which can be fit on *a* DataFrame to
> >> produce a Transformer. E.g., a learning algorithm is an Estimator which
> >> trains on a DataFrame and produces a model.
> >> *Pipeline*
> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline>:
> >> A Pipeline chains multiple Transformers and Estimators together to
> specify
> >> an ML workflow.
> >
> >
> > SparkML clearly declare the quantity of inputs and outputs for Estimator
> > and Transformer API. Scikit-learn does not give clear definition, instead
> > present its APIs [3]:
> >
> >
> >
> >> *Estimator:*The base object, implements a fit method to learn from data,
> >> either:
> >> estimator = estimator.fit(data, targets)
> >> or:
> >> estimator = estimator.fit(data)
> >>
> >> *Transformer:*For filtering or modifying the data, in a supervised or
> >> unsupervised way, implements:
> >> new_data = transformer.transform(data)
> >> When fitting and transforming can be performed much more efficiently
> >> together than separately, implements:
> >> new_data = transformer.fit_transform(data)
> >
> >
> > In their API signatures, one 1 input and 1 output is defined.
> >
> >
> > Another thing I did is to seek some concepts in Big Data APIs to make
> > analogies to Pipeline / Transformer / Estimator in ML APIs, so non-ML
> > developers may have a better understanding about their positions in ML
> APIs.
> >
> > At last, I think 'map' in the MapReduce paradigm may be a fair analogy
> and
> > easy to understand for everyone. One may think 'map' as the MapFunction
> or
> > FlatMapFunction in Flink or Mapper in Hadoop. As far as I know, no Big
> Data
> > APIs trying to extend 'map' to support multiple inputs or outputs and
> still
> > keep the original name. In Flink, there exists co-Map or co-FlatMap which
> > can be considered as extensions, yet they did not use the name 'map'
> anyway.
> >
> >
> > So, the core semantics of 'map' is conversion from data to data, or from
> 1
> > dataset to another dataset? With either answer, the fact is no one breaks
> > the usage convention of 'map'.
> >
> >
> >
> This is an interesting discussion. First of all, I don't think "map" is a
> good comparison here. This method is always defined in a class representing
> a data collection. So there is no actual data input to the method at all.
> The only semantic that makes sense is to operate on the data collection the
> `map` method was defined on. And the parameter of `map` is the processing
> logic, which would also be weird to have more than one.
>
> Regarding scikit-learn and Spark, every API design has their context,
> targeted use cases and design goals. I think it is more important to
> analyze and understand the reason WHY their API looks like that and whether
> they are good designs, instead of just following WHAT they look like.
>
> In my opinion, one primary reason is Spark and scikit-learn Pipeline
> assumes that all the samples, whether for training or inference, are well
> prepared. It basically excluded the data preparation step from the
> Pipeline. Take the recommendation systems as an example, it is quite
> typical that the samples are generated based on user behaviors stored in
> different dataset, such as exposures, clicks, and maybe also user profiles
> stored in the relational databases. So MIMO is a must in this case. Today,
> the data preparation is out of the scope of scikit-learn and not included
> in its Pipeline API. People usually uses other ways such as Pandas or Spark
> DataFrame to prepare the data.
>
> In order to discuss whether the MIMO Pipeline makes sense, we need to think
> whether it is valuable to include the data preparation into the Pipeline as
> well. Personally I think it is a good extension and I don't see much harm
> in doing so. Apparently, a MIMO Pipeline API would also support SISO
> Pipeline, so it is conceptually backwards compatible. For those algorithms
> that only make sense for SISO, they can keep just as is. The only
> difference is that instead of just returning an output Table, they return a
> single-table array. And for those who do need MIMO support, we also make
> them happy. Therefore it looks like a useful feature with little cost.
>
> BTW, I have asked a few AI practitioners, including those from industry and
> academia. The concept of MIMO Pipeline itself seems well accepted. Somewhat
> surprisingly, although the concept of Transformer / Estimator is understood
> by most people I talked to, they are not familiar with what a transformer /
> estimator should look like. I think this is partly because ML Pipeline is a
> well known concept without a well agreed API. In fact, even Spark and
> scikit-learn have quite different designs for Estimator / Transformer /
> Pipeline when it comes to details.
>
>
> > 3. About potential inconsistent availabilit.y of algorithms
> >
> >
> > Becket has mentioned that developers may be confused by how to implement
> > the same algorithm in two levels of APIs of Proposal 2.
> >
> > If one accept the relation between AlgoOperator API and Pipeline API
> > described before, then it is not a problem. It is natural that developers
> > implement their algorithms in AlgoOperators, and call AlgoOperators in
> > Estimator/Transformers.
> >
> >
> > If not, I propose a rough idea here:
> >
> > An abstract class AlgoOpEstimatorImpl is provided as a subclass of
> > Estimator. It has a method named getTrainOp() which returns the
> > AlgoOperator where the computation logic resides. Other codes in
> > AlgoOpEstimatorImpl are fixed. In this way, developers of Flink ML Lib
> are
> > asked to implement Estimator by inheriting AlgoOpEstimatorImpl.
> >
> > Other solutions are also possible, but may still need some community
> > convention.
> >
> >
> > I also would like to mention the same issue exists in Proposal 1, as
> there
> > are also multiple places where developers can implement algorithms.
> >
> >
> I am not sure I fully understand what "there are also multiple places where
> developers can implement algorithms" means. It is always the algorithm
> authors' call in terms of how to implement the interfaces. Implementation
> wise, it is OK to have an abstract class such as AlgoImpl, the algorithm
> authors can choose to leverage it or not. But in either case, the end users
> won't see the implementation class and should only rely on public
> interfaces such as Estimator / Transformer / AlgoOperator, etc.
>
>
> >
>
> > In summary, I think the first and second issue above are
> > preference-related, and hope my thoughts can give some clues. The third
> > issue can be considered as a common technique problem in both proposals.
> We
> > may work together to seek better solutions.
> >
> >
> > Sincerely,
> >
> > Fan Hong.
> >
> >
> >
> > [1] https://github.com/alibaba/Alink
> >
> > [2] https://spark.apache.org/docs/latest/ml-pipeline.html
> >
> > [3] https://scikit-learn.org/stable/developers/develop.html
> >
> > [4]
> >
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
> >
> > On Tue, Jul 20, 2021 at 11:42 AM Becket Qin <becket....@gmail.com>
> wrote:
> >
> >> Hi Dong, Zhipeng and Fan,
> >>
> >> Thanks for the detailed proposals. It is quite a lot of reading! Given
> >> that we are introducing a lot of stuff here, I find that it might be
> easier
> >> for people to discuss if we can list the fundamental differences first.
> >> From what I understand, the very fundamental difference between the two
> >> proposals is following:
> >>
> >> * In order to support graph structure, do we extend
> >> Transformer/Estimator, or do we introduce a new set of API? *
> >>
> >> Proposal 1 tries to keep one set of API, which is based on
> >> Transformer/Estimator/Pipeline. More specifically, it does the
> following:
> >>     - Make Transformer and Estimator multi-input and multi-output
> (MIMO).
> >>     - Introduce a Graph/GraphModel as counter parts of
> >> Pipeline/PipelineModel.
> >>
> >> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
> >> Instead, it introduces AlgoOperators to support the graph structure. The
> >> AlgoOperators are general-purpose graph nodes supporting MIMO. They are
> >> independent of Pipeline / Transformer / Estimator.
> >>
> >>
> >> My two cents:
> >>
> >> I think it is a big advantage to have a single set of API rather than
> two
> >> independent sets of API, if possible. But I would suggest we change the
> >> current proposal 1 a little bit, by learning from proposal 2.
> >>
> >> What I like about proposal 1:
> >> 1. A single set of API, symmetric in Graph/GraphModel and
> >> Pipeline/PipelineModel.
> >> 2. Keeping most of the benefits from Transformer/Estimator, including
> the
> >> fit-then-transform relation and save/load capability.
> >>
> >> However, proposal 1 also introduced some changes that I am not sure
> about:
> >>
> >> 1. The most controversial part of proposal 1 is whether we should extend
> >> the Transformer/Estimator/Pipeline? In fact, different projects have
> >> slightly different designs for Transformer/Estimator/Pipeline. So I
> think
> >> it is OK to extend it. However, there are some commonly recognized core
> >> semantics that we ideally want to keep. To me these core semantics are:
> >>   1. Transformer is a Data -> Data conversion, Estimator deals with Data
> >> -> Model conversion.
> >>   2. Estimator.fit() gives a Transformer, and users can just call
> >> Transformer.transform() to perform inference.
> >> To me, as long as these core semantics are kept, extension to the API
> >> seems acceptable.
> >>
> >> Proposal 1 extends the semantic of Transformer from Data -> Data
> >> conversion to generic Table -> Table conversion, and claims it is
> >> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph
> node.
> >> It does change the first semantic. That said, this might just be a
> naming
> >> problem, though. One possible solution to this problem is having a new
> >> subclass of Stage without strong conventional semantics, e.g. "AlgoOp"
> if
> >> we borrow the name from proposal 2, and let Transformer extend it. Just
> >> like a PipelineModel is a more specific Transformer, a Transformer
> would be
> >> a more specific "AlgoOp". If we do that, the processing logic that
> people
> >> don't feel comfortable to be a Transformer can just be put into an
> "AlgoOp"
> >> and thus can still be added to a Pipeline / Graph. This borrows the
> >> advantage of proposal 2. In another word, this essentially makes the
> >> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be
> >> added to the Graph and Pipeline if people want to.
> >>
> >> This also gives my thoughts regarding the concern that making the
> >> Transformer/Estimator to MIMO would break the convention of single input
> >> single output (SISO) Transformer/Estimator. Since this does not change
> the
> >> core semantic of Transformer/Estimator, it sounds an intuitive
> extension to
> >> me.
> >>
> >> 2. Another semantic related case brought up was heterogeneous topologies
> >> in training and inference. In that case, the input of an Estimator
> would be
> >> different from the input of the transformer returned by Estimator.fit().
> >> The example to this case is Word2Vec, where the input of the Estimator
> >> would be an article while the input to the Transformer is a single word.
> >> The well recognized ML Pipeline doesn't seem to support this case,
> because
> >> it assumes the input of the Estimator and corresponding Transformer are
> the
> >> same.
> >>
> >> Both proposal 1 and proposal 2 leaves this case unsupported in the
> >> Pipeline. To support this case,
> >>    - Proposal 1 adds support to such cases in the Graph/GraphModel by
> >> introducing "EstimatorInput" and "TransformerInput". The downside is
> that
> >> it complicates the API.
> >>    - Proposal 2 leaves this to users to construct two different DAG for
> >> training and inference respectively. This means users would have to
> >> construct the DAG twice even if most parts of the DAG are the same in
> >> training and inference.
> >>
> >> My gut feeling is that this is not a critical difference because such
> >> heterogeneous topology is sort of a corner case. Most users do not need
> to
> >> worry about this. For those who do need this, either proposal 1 and
> >> proposal 2 seems acceptable to me. That said, it looks that with
> proposal
> >> 1, users can still choose to write the program twice without using the
> >> Graph API, just like what they do in proposal 2. So technically
> speaking,
> >> proposal 1 is more flexible and allows users to choose either flavor. On
> >> the other hand, one could argue that proposal 1 may confuse users with
> >> these two flavors. Although personally I feel it is clear to me, I am
> open
> >> to other ideas.
> >>
> >> 3. Lastly, there was a concern about proposal 1 is that some Estimators
> >> can no longer be added to the Pipeline while the original Pipeline
> accepts
> >> any Estimator.
> >>
> >> It seems that users have to always make sure the input schema required
> by
> >> the Estimator matches the input table. So even for the existing
> Pipeline,
> >> people cannot naively add any Estimator into a pipeline. Admittedly,
> >> proposal 1 added some more requirements, namely 1) the number of inputs
> >> needs to match the number of outputs of the previous stage, and 2) the
> >> Estimator does not generate a transformer with different required input
> >> schema (the heterogeneous case mentioned above). However, given that
> these
> >> mismatches will result in exceptions at compile time, just like users
> put
> >> an Estimator with mismatched input schema, personally I find it does not
> >> change the user experience much.
> >>
> >>
> >> So to summarize my thoughts on this fundamental difference.
> >>     - In general, I personally prefer having one set of API.
> >>     - The current proposal 1 may need some improvements in some cases,
> by
> >> borrowing something from proposal 2.
> >>
> >>
> >>
> >> A few other differences that I consider as non-fundamental:
> >>
> >> * Do we need a top level encapsulation API for an Algorithm? *
> >>
> >> Proposal 1 has a concept of Graph which encapsulates the entire
> algorithm
> >> to provide a unified API following the same semantic of
> >> Estimator/Transformer. Users can choose not to package everything into a
> >> Graph, but just write their own program and wrap it as an ordinary
> function.
> >>
> >> Proposal 2 does not have the top level API such as Graph. Instead, users
> >> can choose to write an arbitrary function if they want to.
> >>
> >> From what I understand, in proposal 1, users may still choose to ignore
> >> Graph API and simply construct a DAG by themselves by calling
> transform()
> >> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1
> as
> >> I suggested earlier. So Graph is just an additional way to construct a
> >> graph - people can use Graph in a similar way as they do to the
> >> Pipeline/Pipeline model. In another word, there is no conflict between
> >> proposal 1 and proposal 2.
> >>
> >>
> >> * The ways to describe a Graph? *
> >>
> >> Proposal 1 gives two ways to construct a DAG.
> >> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as
> well).
> >> 2. using the GraphBuilder API.
> >>
> >> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
> >> main output and some other side outputs, so it can call
> >> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
> >> the cost of wrapping all the Tables into an AlgoOperator.
> >>
> >> The usability argument was mostly around the raw APIs. I don't think the
> >> two APIs differ too much from each other. With the same assumption,
> >> proposal 1 and proposal 2 can probably achieve very similar levels of
> >> usability when describing a Graph, if not exactly the same.
> >>
> >>
> >> There are some more other differences/arguments mentioned between the
> two
> >> proposals. However, I don't think they are fundamental. And just like
> the
> >> cases mentioned above, the two proposals can easily learn from each
> other.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
> >>> different
> >>> designs to extend Flink ML API to support more use-cases, e.g.
> >>> expressing a
> >>> DAG of preprocessing and training logics. These two designs have been
> >>> documented in FLIP-173
> >>> <
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
> >>> >
> >>> 。
> >>>
> >>> We have different opinions on the usability and the
> ease-of-understanding
> >>> of the proposed APIs. It will be really useful to have comments of
> those
> >>> designs from the open source community and to learn your preferences.
> >>>
> >>> To facilitate the discussion, we have summarized our design principles
> >>> and
> >>> opinions in this Google doc
> >>> <
> >>>
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
> >>> >.
> >>> Code snippets for a few example use-cases are also provided in this doc
> >>> to
> >>> demonstrate the difference between these two solutions.
> >>>
> >>> This Flink ML API is super important to the future of Flink ML library.
> >>> Please feel free to reply to this email thread or comment in the Google
> >>> doc
> >>> directly.
> >>>
> >>> Thank you!
> >>> Dong, Zhipeng, Fan
> >>>
> >>
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to