Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Dong Lin Mon, 06 Sep 2021 01:53:48 -0700

Hi everyone,

Just FYI, if there is no further suggestion on FLIP, we plan to start the
voting thread this Friday on 9/10.


Thanks,
Dong


On Fri, Aug 27, 2021 at 10:32 AM Zhipeng Zhang <zhangzhipe...@gmail.com>
wrote:

> Thanks for the post, Dong :)
>
> We welcome everyone to drop us an email on Flink ML. Let's work together
> to build machine learning on Flink :)
>
>
> Dong Lin <lindon...@gmail.com> 于2021年8月25日周三 下午8:58写道：
>
>> Hi everyone,
>>
>> Based on the feedback received in the online/offline discussion in the
>> past few weeks, we (Zhepeng, Fan, myself and a few other developers at
>> Alibaba) have reached agreement on the design to support DAG of algorithms.
>> We have merged the ideas from the intial two options into this FLIP-176
>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783> 
>> design
>> doc.
>>
>> If you have comments on the latest design doc, please let us know!
>>
>> Cheers,
>> Dong
>>
>>
>> On Mon, Aug 23, 2021 at 5:07 PM Becket Qin <becket....@gmail.com> wrote:
>>
>>> Thanks for the comments, Fan. Please see the reply inline.
>>>
>>> On Thu, Aug 19, 2021 at 10:25 PM Fan Hong <hongfa...@gmail.com> wrote:
>>>
>>> > Hi, Becket,
>>> >
>>> > Many thanks to your detailed review. I agree that it is easier to
>>> involve
>>> > more people to discuss if fundamental differences are highlighted.
>>> >
>>> >
>>> > Here are some of my thoughts to help other people to think about these
>>> > differences. (correct me if those technique details are not right.)
>>> >
>>> >
>>> > 1. One set of API or not? May be not that important.
>>> >
>>> >
>>> > First of all, AlgoOperators and Pipeline / Transformer / Estimator in
>>> > Proposal 2 are absolutely *NOT* independent.
>>> >
>>> >
>>> > One may think they are independent, because they see Pipeline /
>>> > Transformer / Estimator are already in Flink ML Lib and AlgoOperators
>>> are
>>> > lately added in this proposal. But that's not true. If you check
>>> Alink[1]
>>> > where the idea of Proposal 2 originated, both of them have been
>>> presented
>>> > long ago, and they collaborate tightly.
>>> >
>>>
>>> > In the aspects of functionalities, they are also not independent. Their
>>> > relation is more like a two-level API to specify ML tasks:
>>> AlgoOperators is
>>> > a general-purpose level to represent any ML algorithms, while Pipeline
>>> /
>>> > Transformer / Estimator provides a higher-level API which enables
>>> wrapping
>>> > multiple ML algorithms together in a fit-transform way.
>>> >
>>>
>>> We probably need to first clarify what "independent" means here. Sure,
>>> users can always wrap the Transformer into an AlgoOperator, but users can
>>> basically wrap any code, any class into an AlgoOperator. And we wouldn't
>>> say AlgoOperator is not independent of any class, right? In my opinion,
>>> the
>>> two APIs are independent because even if we agree that Transformers are
>>> doing things that are conceptually a subset of what AlgoOperators do, a
>>> Transformer cannot be used as an AlgoOperator out of the box without
>>> wrapping. And even worse, a MIMO AlgoOperator cannot be wrapped into a
>>> Transformer / Estimator if these two APIs are SISO. So from what I see,
>>> in
>>> Option 2, these two APIs are independent from API design perspective.
>>>
>>> One could consider Flink DataStream - Table as an analogy to
>>> AlgoOperators
>>> > - Pipeline. The two-level APIs provides different functionalities to
>>> end
>>> > users, and the higher-level API will call lower-level of API in
>>> internal
>>> > implementation. I'm not saying the two-level API design in Proposal 2
>>> is
>>> > good because Flink already did this. I just hope to help community
>>> people
>>> > to understand the relation between AlgoOperators and Pipeline.
>>> >
>>>
>>> I am not sure if it is accurate to say DataStream is a low-level API of
>>> Table. They are simply two different DSL, one for relational / SQL-like
>>> analytics paradigm, and the other for those who are more familiar with
>>> streaming applications. More importantly, they are designed to
>>> support conversion from one to the other out of the box, which is unlike
>>> Pipeline and AlgoOperators in proposal 2.
>>>
>>>
>>> > An additional usage and benefit of Pipeline API is that SISO
>>> PipelineModel
>>> > corresponds to a deployable unit for online serving exactly.
>>> >
>>> > In online serving, Flink runtime are usually avoided to achieve low
>>> > latency. So models have to be wrapped for transmission from Flink
>>> ecosystem
>>> > to a non-Flink one. Here is the place where the wrapping is really
>>> needed
>>> > and inevitable, because the serving service providers are usually
>>> expected
>>> > to be general to one type of models. Pipeline API in Proposal 2 target
>>> to
>>> > this scene exactly without complicated APIs.
>>> >
>>> > Yet, for offline or nearline inference, they can be completed in Flink
>>> > ecosystem. That's where Flink ML Lib still exists, so a loose wrapping
>>> > using AlgoOperators in Proposal 2 still works with not much overhead.
>>> >
>>>
>>> It seems that a MIMO transformer can easily support all SISO use cases,
>>> right? And there is zero overhead because users may not have to wrap
>>> AlgoOperators, but can just build a Pipeline directly by putting either
>>> Transformer or AlgoOperators into it, without worrying about whether they
>>> are interoperable.
>>>
>>>
>>> > At the same time, these two levels of APIs are not redundant in their
>>> > functionalities, they have to collaborate to build ML tasks.
>>> >
>>> > AlgoOperator API is self-consistent and self-complete in constructing
>>> ML
>>> > tasks, but if users are seeking to wrap a sequence of subtasks,
>>> especially
>>> > for online serving, Pipeline / Transformer / Estimator API is
>>> inevitable.
>>> > On the other side, Pipeline / Transformer / Estimator API lacks
>>> > completeness, even for the extended version plus Graph API in Proposal
>>> 1
>>> > (last case in [4]), so it cannot replace AlgoOperator API.
>>> >
>>> > One case of their collaboration lies in my response to Mingliang's
>>> > recommendation scenarios, where AlgoOperators + Pipeline can provide
>>> > cleaner usage than Graph API.
>>> >
>>>
>>> I think the link/linkFrom API is more like a convenient wrapper of
>>> fit/transform/compute. Functionality wise, they are equivalent. The
>>> Graph/GraphBuilder API, on the other hand, is an encapsulation design, on
>>> top of Estimator/Transformer/AlgoOperators. Without the
>>> GraphBuilder/Graph
>>> API, users would have to create their own class to encapsulate the code.
>>> Just like without the Pipeline API, users would have to create their own
>>> class to wrap the pipeline logic. So I don't think we should compare
>>> link/linkFrom with Graph/GraphBuilder API because they serve different
>>> purposes. Even though both of them need to describe a DAG, GraphBuilder
>>> is
>>> describing for encapsulation while link/linkFrom is not.
>>>
>>>
>>> >
>>> > 2. What is core semantics of Pipeline / Transformer / Estimator?
>>> >
>>> >
>>> > I will not give my answer because I can't. I think it would be
>>> difficult
>>> > to reach an agreement on this.
>>> >
>>> > But I did two things, and hope they can provide some hints.
>>> >
>>> >
>>> > One thing is to seek answers from other ML libraries. Scikit-learn and
>>> > SparkML are well-known general-purpose ML libraries.
>>> >
>>> > Spark ML gives the definition of Pipeline / Transformer / Estimator in
>>> its
>>> > documents. Here I quote as follows [2]:
>>> >
>>> >
>>> > *Transformer*
>>> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#transformers>:
>>> >> A Transformer is an algorithm which can transform *one* DataFrame into
>>> >> *another* DataFrame. E.g., an ML model is a Transformer which
>>> transforms
>>> >> *a* DataFrame with features into *a* DataFrame with predictions.
>>> >> *Estimator*
>>> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#estimators>:
>>> >> An Estimator is an algorithm which can be fit on *a* DataFrame to
>>> >> produce a Transformer. E.g., a learning algorithm is an Estimator
>>> which
>>> >> trains on a DataFrame and produces a model.
>>> >> *Pipeline*
>>> >> <https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline>:
>>> >> A Pipeline chains multiple Transformers and Estimators together to
>>> specify
>>> >> an ML workflow.
>>> >
>>> >
>>> > SparkML clearly declare the quantity of inputs and outputs for
>>> Estimator
>>> > and Transformer API. Scikit-learn does not give clear definition,
>>> instead
>>> > present its APIs [3]:
>>> >
>>> >
>>> >
>>> >> *Estimator:*The base object, implements a fit method to learn from
>>> data,
>>> >> either:
>>> >> estimator = estimator.fit(data, targets)
>>> >> or:
>>> >> estimator = estimator.fit(data)
>>> >>
>>> >> *Transformer:*For filtering or modifying the data, in a supervised or
>>> >> unsupervised way, implements:
>>> >> new_data = transformer.transform(data)
>>> >> When fitting and transforming can be performed much more efficiently
>>> >> together than separately, implements:
>>> >> new_data = transformer.fit_transform(data)
>>> >
>>> >
>>> > In their API signatures, one 1 input and 1 output is defined.
>>> >
>>> >
>>> > Another thing I did is to seek some concepts in Big Data APIs to make
>>> > analogies to Pipeline / Transformer / Estimator in ML APIs, so non-ML
>>> > developers may have a better understanding about their positions in ML
>>> APIs.
>>> >
>>> > At last, I think 'map' in the MapReduce paradigm may be a fair analogy
>>> and
>>> > easy to understand for everyone. One may think 'map' as the
>>> MapFunction or
>>> > FlatMapFunction in Flink or Mapper in Hadoop. As far as I know, no Big
>>> Data
>>> > APIs trying to extend 'map' to support multiple inputs or outputs and
>>> still
>>> > keep the original name. In Flink, there exists co-Map or co-FlatMap
>>> which
>>> > can be considered as extensions, yet they did not use the name 'map'
>>> anyway.
>>> >
>>> >
>>> > So, the core semantics of 'map' is conversion from data to data, or
>>> from 1
>>> > dataset to another dataset? With either answer, the fact is no one
>>> breaks
>>> > the usage convention of 'map'.
>>> >
>>> >
>>> >
>>> This is an interesting discussion. First of all, I don't think "map" is a
>>> good comparison here. This method is always defined in a class
>>> representing
>>> a data collection. So there is no actual data input to the method at all.
>>> The only semantic that makes sense is to operate on the data collection
>>> the
>>> `map` method was defined on. And the parameter of `map` is the processing
>>> logic, which would also be weird to have more than one.
>>>
>>> Regarding scikit-learn and Spark, every API design has their context,
>>> targeted use cases and design goals. I think it is more important to
>>> analyze and understand the reason WHY their API looks like that and
>>> whether
>>> they are good designs, instead of just following WHAT they look like.
>>>
>>> In my opinion, one primary reason is Spark and scikit-learn Pipeline
>>> assumes that all the samples, whether for training or inference, are well
>>> prepared. It basically excluded the data preparation step from the
>>> Pipeline. Take the recommendation systems as an example, it is quite
>>> typical that the samples are generated based on user behaviors stored in
>>> different dataset, such as exposures, clicks, and maybe also user
>>> profiles
>>> stored in the relational databases. So MIMO is a must in this case.
>>> Today,
>>> the data preparation is out of the scope of scikit-learn and not included
>>> in its Pipeline API. People usually uses other ways such as Pandas or
>>> Spark
>>> DataFrame to prepare the data.
>>>
>>> In order to discuss whether the MIMO Pipeline makes sense, we need to
>>> think
>>> whether it is valuable to include the data preparation into the Pipeline
>>> as
>>> well. Personally I think it is a good extension and I don't see much harm
>>> in doing so. Apparently, a MIMO Pipeline API would also support SISO
>>> Pipeline, so it is conceptually backwards compatible. For those
>>> algorithms
>>> that only make sense for SISO, they can keep just as is. The only
>>> difference is that instead of just returning an output Table, they
>>> return a
>>> single-table array. And for those who do need MIMO support, we also make
>>> them happy. Therefore it looks like a useful feature with little cost.
>>>
>>> BTW, I have asked a few AI practitioners, including those from industry
>>> and
>>> academia. The concept of MIMO Pipeline itself seems well accepted.
>>> Somewhat
>>> surprisingly, although the concept of Transformer / Estimator is
>>> understood
>>> by most people I talked to, they are not familiar with what a
>>> transformer /
>>> estimator should look like. I think this is partly because ML Pipeline
>>> is a
>>> well known concept without a well agreed API. In fact, even Spark and
>>> scikit-learn have quite different designs for Estimator / Transformer /
>>> Pipeline when it comes to details.
>>>
>>>
>>> > 3. About potential inconsistent availabilit.y of algorithms
>>> >
>>> >
>>> > Becket has mentioned that developers may be confused by how to
>>> implement
>>> > the same algorithm in two levels of APIs of Proposal 2.
>>> >
>>> > If one accept the relation between AlgoOperator API and Pipeline API
>>> > described before, then it is not a problem. It is natural that
>>> developers
>>> > implement their algorithms in AlgoOperators, and call AlgoOperators in
>>> > Estimator/Transformers.
>>> >
>>> >
>>> > If not, I propose a rough idea here:
>>> >
>>> > An abstract class AlgoOpEstimatorImpl is provided as a subclass of
>>> > Estimator. It has a method named getTrainOp() which returns the
>>> > AlgoOperator where the computation logic resides. Other codes in
>>> > AlgoOpEstimatorImpl are fixed. In this way, developers of Flink ML Lib
>>> are
>>> > asked to implement Estimator by inheriting AlgoOpEstimatorImpl.
>>> >
>>> > Other solutions are also possible, but may still need some community
>>> > convention.
>>> >
>>> >
>>> > I also would like to mention the same issue exists in Proposal 1, as
>>> there
>>> > are also multiple places where developers can implement algorithms.
>>> >
>>> >
>>> I am not sure I fully understand what "there are also multiple places
>>> where
>>> developers can implement algorithms" means. It is always the algorithm
>>> authors' call in terms of how to implement the interfaces. Implementation
>>> wise, it is OK to have an abstract class such as AlgoImpl, the algorithm
>>> authors can choose to leverage it or not. But in either case, the end
>>> users
>>> won't see the implementation class and should only rely on public
>>> interfaces such as Estimator / Transformer / AlgoOperator, etc.
>>>
>>>
>>> >
>>>
>>> > In summary, I think the first and second issue above are
>>> > preference-related, and hope my thoughts can give some clues. The third
>>> > issue can be considered as a common technique problem in both
>>> proposals. We
>>> > may work together to seek better solutions.
>>> >
>>> >
>>> > Sincerely,
>>> >
>>> > Fan Hong.
>>> >
>>> >
>>> >
>>> > [1] https://github.com/alibaba/Alink
>>> >
>>> > [2] https://spark.apache.org/docs/latest/ml-pipeline.html
>>> >
>>> > [3] https://scikit-learn.org/stable/developers/develop.html
>>> >
>>> > [4]
>>> >
>>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
>>> >
>>> > On Tue, Jul 20, 2021 at 11:42 AM Becket Qin <becket....@gmail.com>
>>> wrote:
>>> >
>>> >> Hi Dong, Zhipeng and Fan,
>>> >>
>>> >> Thanks for the detailed proposals. It is quite a lot of reading! Given
>>> >> that we are introducing a lot of stuff here, I find that it might be
>>> easier
>>> >> for people to discuss if we can list the fundamental differences
>>> first.
>>> >> From what I understand, the very fundamental difference between the
>>> two
>>> >> proposals is following:
>>> >>
>>> >> * In order to support graph structure, do we extend
>>> >> Transformer/Estimator, or do we introduce a new set of API? *
>>> >>
>>> >> Proposal 1 tries to keep one set of API, which is based on
>>> >> Transformer/Estimator/Pipeline. More specifically, it does the
>>> following:
>>> >>     - Make Transformer and Estimator multi-input and multi-output
>>> (MIMO).
>>> >>     - Introduce a Graph/GraphModel as counter parts of
>>> >> Pipeline/PipelineModel.
>>> >>
>>> >> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
>>> >> Instead, it introduces AlgoOperators to support the graph structure.
>>> The
>>> >> AlgoOperators are general-purpose graph nodes supporting MIMO. They
>>> are
>>> >> independent of Pipeline / Transformer / Estimator.
>>> >>
>>> >>
>>> >> My two cents:
>>> >>
>>> >> I think it is a big advantage to have a single set of API rather than
>>> two
>>> >> independent sets of API, if possible. But I would suggest we change
>>> the
>>> >> current proposal 1 a little bit, by learning from proposal 2.
>>> >>
>>> >> What I like about proposal 1:
>>> >> 1. A single set of API, symmetric in Graph/GraphModel and
>>> >> Pipeline/PipelineModel.
>>> >> 2. Keeping most of the benefits from Transformer/Estimator, including
>>> the
>>> >> fit-then-transform relation and save/load capability.
>>> >>
>>> >> However, proposal 1 also introduced some changes that I am not sure
>>> about:
>>> >>
>>> >> 1. The most controversial part of proposal 1 is whether we should
>>> extend
>>> >> the Transformer/Estimator/Pipeline? In fact, different projects have
>>> >> slightly different designs for Transformer/Estimator/Pipeline. So I
>>> think
>>> >> it is OK to extend it. However, there are some commonly recognized
>>> core
>>> >> semantics that we ideally want to keep. To me these core semantics
>>> are:
>>> >>   1. Transformer is a Data -> Data conversion, Estimator deals with
>>> Data
>>> >> -> Model conversion.
>>> >>   2. Estimator.fit() gives a Transformer, and users can just call
>>> >> Transformer.transform() to perform inference.
>>> >> To me, as long as these core semantics are kept, extension to the API
>>> >> seems acceptable.
>>> >>
>>> >> Proposal 1 extends the semantic of Transformer from Data -> Data
>>> >> conversion to generic Table -> Table conversion, and claims it is
>>> >> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph
>>> node.
>>> >> It does change the first semantic. That said, this might just be a
>>> naming
>>> >> problem, though. One possible solution to this problem is having a new
>>> >> subclass of Stage without strong conventional semantics, e.g.
>>> "AlgoOp" if
>>> >> we borrow the name from proposal 2, and let Transformer extend it.
>>> Just
>>> >> like a PipelineModel is a more specific Transformer, a Transformer
>>> would be
>>> >> a more specific "AlgoOp". If we do that, the processing logic that
>>> people
>>> >> don't feel comfortable to be a Transformer can just be put into an
>>> "AlgoOp"
>>> >> and thus can still be added to a Pipeline / Graph. This borrows the
>>> >> advantage of proposal 2. In another word, this essentially makes the
>>> >> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to
>>> be
>>> >> added to the Graph and Pipeline if people want to.
>>> >>
>>> >> This also gives my thoughts regarding the concern that making the
>>> >> Transformer/Estimator to MIMO would break the convention of single
>>> input
>>> >> single output (SISO) Transformer/Estimator. Since this does not
>>> change the
>>> >> core semantic of Transformer/Estimator, it sounds an intuitive
>>> extension to
>>> >> me.
>>> >>
>>> >> 2. Another semantic related case brought up was heterogeneous
>>> topologies
>>> >> in training and inference. In that case, the input of an Estimator
>>> would be
>>> >> different from the input of the transformer returned by
>>> Estimator.fit().
>>> >> The example to this case is Word2Vec, where the input of the Estimator
>>> >> would be an article while the input to the Transformer is a single
>>> word.
>>> >> The well recognized ML Pipeline doesn't seem to support this case,
>>> because
>>> >> it assumes the input of the Estimator and corresponding Transformer
>>> are the
>>> >> same.
>>> >>
>>> >> Both proposal 1 and proposal 2 leaves this case unsupported in the
>>> >> Pipeline. To support this case,
>>> >>    - Proposal 1 adds support to such cases in the Graph/GraphModel by
>>> >> introducing "EstimatorInput" and "TransformerInput". The downside is
>>> that
>>> >> it complicates the API.
>>> >>    - Proposal 2 leaves this to users to construct two different DAG
>>> for
>>> >> training and inference respectively. This means users would have to
>>> >> construct the DAG twice even if most parts of the DAG are the same in
>>> >> training and inference.
>>> >>
>>> >> My gut feeling is that this is not a critical difference because such
>>> >> heterogeneous topology is sort of a corner case. Most users do not
>>> need to
>>> >> worry about this. For those who do need this, either proposal 1 and
>>> >> proposal 2 seems acceptable to me. That said, it looks that with
>>> proposal
>>> >> 1, users can still choose to write the program twice without using the
>>> >> Graph API, just like what they do in proposal 2. So technically
>>> speaking,
>>> >> proposal 1 is more flexible and allows users to choose either flavor.
>>> On
>>> >> the other hand, one could argue that proposal 1 may confuse users with
>>> >> these two flavors. Although personally I feel it is clear to me, I am
>>> open
>>> >> to other ideas.
>>> >>
>>> >> 3. Lastly, there was a concern about proposal 1 is that some
>>> Estimators
>>> >> can no longer be added to the Pipeline while the original Pipeline
>>> accepts
>>> >> any Estimator.
>>> >>
>>> >> It seems that users have to always make sure the input schema
>>> required by
>>> >> the Estimator matches the input table. So even for the existing
>>> Pipeline,
>>> >> people cannot naively add any Estimator into a pipeline. Admittedly,
>>> >> proposal 1 added some more requirements, namely 1) the number of
>>> inputs
>>> >> needs to match the number of outputs of the previous stage, and 2) the
>>> >> Estimator does not generate a transformer with different required
>>> input
>>> >> schema (the heterogeneous case mentioned above). However, given that
>>> these
>>> >> mismatches will result in exceptions at compile time, just like users
>>> put
>>> >> an Estimator with mismatched input schema, personally I find it does
>>> not
>>> >> change the user experience much.
>>> >>
>>> >>
>>> >> So to summarize my thoughts on this fundamental difference.
>>> >>     - In general, I personally prefer having one set of API.
>>> >>     - The current proposal 1 may need some improvements in some
>>> cases, by
>>> >> borrowing something from proposal 2.
>>> >>
>>> >>
>>> >>
>>> >> A few other differences that I consider as non-fundamental:
>>> >>
>>> >> * Do we need a top level encapsulation API for an Algorithm? *
>>> >>
>>> >> Proposal 1 has a concept of Graph which encapsulates the entire
>>> algorithm
>>> >> to provide a unified API following the same semantic of
>>> >> Estimator/Transformer. Users can choose not to package everything
>>> into a
>>> >> Graph, but just write their own program and wrap it as an ordinary
>>> function.
>>> >>
>>> >> Proposal 2 does not have the top level API such as Graph. Instead,
>>> users
>>> >> can choose to write an arbitrary function if they want to.
>>> >>
>>> >> From what I understand, in proposal 1, users may still choose to
>>> ignore
>>> >> Graph API and simply construct a DAG by themselves by calling
>>> transform()
>>> >> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal
>>> 1 as
>>> >> I suggested earlier. So Graph is just an additional way to construct a
>>> >> graph - people can use Graph in a similar way as they do to the
>>> >> Pipeline/Pipeline model. In another word, there is no conflict between
>>> >> proposal 1 and proposal 2.
>>> >>
>>> >>
>>> >> * The ways to describe a Graph? *
>>> >>
>>> >> Proposal 1 gives two ways to construct a DAG.
>>> >> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as
>>> well).
>>> >> 2. using the GraphBuilder API.
>>> >>
>>> >> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is
>>> a
>>> >> main output and some other side outputs, so it can call
>>> >> algoOp1.linkFrom(algoOp2) without specifying the index of the output,
>>> at
>>> >> the cost of wrapping all the Tables into an AlgoOperator.
>>> >>
>>> >> The usability argument was mostly around the raw APIs. I don't think
>>> the
>>> >> two APIs differ too much from each other. With the same assumption,
>>> >> proposal 1 and proposal 2 can probably achieve very similar levels of
>>> >> usability when describing a Graph, if not exactly the same.
>>> >>
>>> >>
>>> >> There are some more other differences/arguments mentioned between the
>>> two
>>> >> proposals. However, I don't think they are fundamental. And just like
>>> the
>>> >> cases mentioned above, the two proposals can easily learn from each
>>> other.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Jiangjie (Becket) Qin
>>> >>
>>> >> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote:
>>> >>
>>> >>> Hi all,
>>> >>>
>>> >>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
>>> >>> different
>>> >>> designs to extend Flink ML API to support more use-cases, e.g.
>>> >>> expressing a
>>> >>> DAG of preprocessing and training logics. These two designs have been
>>> >>> documented in FLIP-173
>>> >>> <
>>> >>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
>>> >>> >
>>> >>> 。
>>> >>>
>>> >>> We have different opinions on the usability and the
>>> ease-of-understanding
>>> >>> of the proposed APIs. It will be really useful to have comments of
>>> those
>>> >>> designs from the open source community and to learn your preferences.
>>> >>>
>>> >>> To facilitate the discussion, we have summarized our design
>>> principles
>>> >>> and
>>> >>> opinions in this Google doc
>>> >>> <
>>> >>>
>>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
>>> >>> >.
>>> >>> Code snippets for a few example use-cases are also provided in this
>>> doc
>>> >>> to
>>> >>> demonstrate the difference between these two solutions.
>>> >>>
>>> >>> This Flink ML API is super important to the future of Flink ML
>>> library.
>>> >>> Please feel free to reply to this email thread or comment in the
>>> Google
>>> >>> doc
>>> >>> directly.
>>> >>>
>>> >>> Thank you!
>>> >>> Dong, Zhipeng, Fan
>>> >>>
>>> >>
>>>
>>
>
> --
> best,
> Zhipeng
>
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to