Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Fan Hong Thu, 19 Aug 2021 07:26:01 -0700

Hi, Becket,

Many thanks to your detailed review. I agree that it is easier to involve
more people to discuss if fundamental differences are highlighted.

Here are some of my thoughts to help other people to think about these
differences. (correct me if those technique details are not right.)

1. One set of API or not? May be not that important.

First of all, AlgoOperators and Pipeline / Transformer / Estimator in
Proposal 2 are absolutely *NOT* independent.

One may think they are independent, because they see Pipeline / Transformer
/ Estimator are already in Flink ML Lib and AlgoOperators are lately added
in this proposal. But that's not true. If you check Alink[1] where the idea
of Proposal 2 originated, both of them have been presented long ago, and
they collaborate tightly.

In the aspects of functionalities, they are also not independent. Their
relation is more like a two-level API to specify ML tasks: AlgoOperators is
a general-purpose level to represent any ML algorithms, while Pipeline /
Transformer / Estimator provides a higher-level API which enables wrapping
multiple ML algorithms together in a fit-transform way.

One could consider Flink DataStream - Table as an analogy to AlgoOperators
- Pipeline. The two-level APIs provides different functionalities to end
users, and the higher-level API will call lower-level of API in internal
implementation. I'm not saying the two-level API design in Proposal 2 is
good because Flink already did this. I just hope to help community people
to understand the relation between AlgoOperators and Pipeline.

An additional usage and benefit of Pipeline API is that SISO PipelineModel
corresponds to a deployable unit for online serving exactly.

In online serving, Flink runtime are usually avoided to achieve low
latency. So models have to be wrapped for transmission from Flink ecosystem
to a non-Flink one. Here is the place where the wrapping is really needed
and inevitable, because the serving service providers are usually expected
to be general to one type of models. Pipeline API in Proposal 2 target to
this scene exactly without complicated APIs.

Yet, for offline or nearline inference, they can be completed in Flink
ecosystem. That's where Flink ML Lib still exists, so a loose wrapping
using AlgoOperators in Proposal 2 still works with not much overhead.

At the same time, these two levels of APIs are not redundant in their
functionalities, they have to collaborate to build ML tasks.

AlgoOperator API is self-consistent and self-complete in constructing ML
tasks, but if users are seeking to wrap a sequence of subtasks, especially
for online serving, Pipeline / Transformer / Estimator API is inevitable.
On the other side, Pipeline / Transformer / Estimator API lacks
completeness, even for the extended version plus Graph API in Proposal 1
(last case in [4]), so it cannot replace AlgoOperator API.

One case of their collaboration lies in my response to Mingliang's
recommendation scenarios, where AlgoOperators + Pipeline can provide
cleaner usage than Graph API.

2. What is core semantics of Pipeline / Transformer / Estimator?

I will not give my answer because I can't. I think it would be difficult to
reach an agreement on this.

But I did two things, and hope they can provide some hints.

One thing is to seek answers from other ML libraries. Scikit-learn and
SparkML are well-known general-purpose ML libraries.

Spark ML gives the definition of Pipeline / Transformer / Estimator in its
documents. Here I quote as follows [2]:

*Transformer*
> <https://spark.apache.org/docs/latest/ml-pipeline.html#transformers>:
> A Transformer is an algorithm which can transform *one* DataFrame into
> *another* DataFrame. E.g., an ML model is a Transformer which transforms
> *a* DataFrame with features into *a* DataFrame with predictions.
> *Estimator*
> <https://spark.apache.org/docs/latest/ml-pipeline.html#estimators>:
> An Estimator is an algorithm which can be fit on *a* DataFrame to produce
> a Transformer. E.g., a learning algorithm is an Estimator which trains on
> a DataFrame and produces a model.
> *Pipeline*
> <https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline>:
> A Pipeline chains multiple Transformers and Estimators together to specify
> an ML workflow.

SparkML clearly declare the quantity of inputs and outputs for Estimator
and Transformer API. Scikit-learn does not give clear definition, instead
present its APIs [3]:

> *Estimator:*The base object, implements a fit method to learn from data,
> either:
> estimator = estimator.fit(data, targets)
> or:
> estimator = estimator.fit(data)
>
> *Transformer:*For filtering or modifying the data, in a supervised or
> unsupervised way, implements:
> new_data = transformer.transform(data)
> When fitting and transforming can be performed much more efficiently
> together than separately, implements:
> new_data = transformer.fit_transform(data)

In their API signatures, one 1 input and 1 output is defined.

Another thing I did is to seek some concepts in Big Data APIs to make
analogies to Pipeline / Transformer / Estimator in ML APIs, so non-ML
developers may have a better understanding about their positions in ML APIs.

At last, I think 'map' in the MapReduce paradigm may be a fair analogy and
easy to understand for everyone. One may think 'map' as the MapFunction or
FlatMapFunction in Flink or Mapper in Hadoop. As far as I know, no Big Data
APIs trying to extend 'map' to support multiple inputs or outputs and still
keep the original name. In Flink, there exists co-Map or co-FlatMap which
can be considered as extensions, yet they did not use the name 'map' anyway.

So, the core semantics of 'map' is conversion from data to data, or from 1
dataset to another dataset? With either answer, the fact is no one breaks
the usage convention of 'map'.

3. About potential inconsistent availability of algorithms

Becket has mentioned that developers may be confused by how to implement
the same algorithm in two levels of APIs of Proposal 2.

If one accept the relation between AlgoOperator API and Pipeline API
described before, then it is not a problem. It is natural that developers
implement their algorithms in AlgoOperators, and call AlgoOperators in
Estimator/Transformers.

If not, I propose a rough idea here:

An abstract class AlgoOpEstimatorImpl is provided as a subclass of
Estimator. It has a method named getTrainOp() which returns the
AlgoOperator where the computation logic resides. Other codes in
AlgoOpEstimatorImpl are fixed. In this way, developers of Flink ML Lib are
asked to implement Estimator by inheriting AlgoOpEstimatorImpl.

Other solutions are also possible, but may still need some community
convention.

I also would like to mention the same issue exists in Proposal 1, as there
are also multiple places where developers can implement algorithms.

In summary, I think the first and second issue above are
preference-related, and hope my thoughts can give some clues. The third
issue can be considered as a common technique problem in both proposals. We
may work together to seek better solutions.

Sincerely,

Fan Hong.

[1] https://github.com/alibaba/Alink

[2] https://spark.apache.org/docs/latest/ml-pipeline.html

[3] https://scikit-learn.org/stable/developers/develop.html

[4]
https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do

On Tue, Jul 20, 2021 at 11:42 AM Becket Qin <[email protected]> wrote:

> Hi Dong, Zhipeng and Fan,
>
> Thanks for the detailed proposals. It is quite a lot of reading! Given
> that we are introducing a lot of stuff here, I find that it might be easier
> for people to discuss if we can list the fundamental differences first.
> From what I understand, the very fundamental difference between the two
> proposals is following:
>
> * In order to support graph structure, do we extend Transformer/Estimator,
> or do we introduce a new set of API? *
>
> Proposal 1 tries to keep one set of API, which is based on
> Transformer/Estimator/Pipeline. More specifically, it does the following:
>     - Make Transformer and Estimator multi-input and multi-output (MIMO).
>     - Introduce a Graph/GraphModel as counter parts of
> Pipeline/PipelineModel.
>
> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
> Instead, it introduces AlgoOperators to support the graph structure. The
> AlgoOperators are general-purpose graph nodes supporting MIMO. They are
> independent of Pipeline / Transformer / Estimator.
>
>
> My two cents:
>
> I think it is a big advantage to have a single set of API rather than two
> independent sets of API, if possible. But I would suggest we change the
> current proposal 1 a little bit, by learning from proposal 2.
>
> What I like about proposal 1:
> 1. A single set of API, symmetric in Graph/GraphModel and
> Pipeline/PipelineModel.
> 2. Keeping most of the benefits from Transformer/Estimator, including the
> fit-then-transform relation and save/load capability.
>
> However, proposal 1 also introduced some changes that I am not sure about:
>
> 1. The most controversial part of proposal 1 is whether we should extend
> the Transformer/Estimator/Pipeline? In fact, different projects have
> slightly different designs for Transformer/Estimator/Pipeline. So I think
> it is OK to extend it. However, there are some commonly recognized core
> semantics that we ideally want to keep. To me these core semantics are:
>   1. Transformer is a Data -> Data conversion, Estimator deals with Data
> -> Model conversion.
>   2. Estimator.fit() gives a Transformer, and users can just call
> Transformer.transform() to perform inference.
> To me, as long as these core semantics are kept, extension to the API
> seems acceptable.
>
> Proposal 1 extends the semantic of Transformer from Data -> Data
> conversion to generic Table -> Table conversion, and claims it is
> equivalent to "AlgoOperator" in proposal 2 as a general-purpose graph node.
> It does change the first semantic. That said, this might just be a naming
> problem, though. One possible solution to this problem is having a new
> subclass of Stage without strong conventional semantics, e.g. "AlgoOp" if
> we borrow the name from proposal 2, and let Transformer extend it. Just
> like a PipelineModel is a more specific Transformer, a Transformer would be
> a more specific "AlgoOp". If we do that, the processing logic that people
> don't feel comfortable to be a Transformer can just be put into an "AlgoOp"
> and thus can still be added to a Pipeline / Graph. This borrows the
> advantage of proposal 2. In another word, this essentially makes the
> "AlgoOp" equivalent of "AlgoOperator" in proposal 2, but allows it to be
> added to the Graph and Pipeline if people want to.
>
> This also gives my thoughts regarding the concern that making the
> Transformer/Estimator to MIMO would break the convention of single input
> single output (SISO) Transformer/Estimator. Since this does not change the
> core semantic of Transformer/Estimator, it sounds an intuitive extension to
> me.
>
> 2. Another semantic related case brought up was heterogeneous topologies
> in training and inference. In that case, the input of an Estimator would be
> different from the input of the transformer returned by Estimator.fit().
> The example to this case is Word2Vec, where the input of the Estimator
> would be an article while the input to the Transformer is a single word.
> The well recognized ML Pipeline doesn't seem to support this case, because
> it assumes the input of the Estimator and corresponding Transformer are the
> same.
>
> Both proposal 1 and proposal 2 leaves this case unsupported in the
> Pipeline. To support this case,
>    - Proposal 1 adds support to such cases in the Graph/GraphModel by
> introducing "EstimatorInput" and "TransformerInput". The downside is that
> it complicates the API.
>    - Proposal 2 leaves this to users to construct two different DAG for
> training and inference respectively. This means users would have to
> construct the DAG twice even if most parts of the DAG are the same in
> training and inference.
>
> My gut feeling is that this is not a critical difference because such
> heterogeneous topology is sort of a corner case. Most users do not need to
> worry about this. For those who do need this, either proposal 1 and
> proposal 2 seems acceptable to me. That said, it looks that with proposal
> 1, users can still choose to write the program twice without using the
> Graph API, just like what they do in proposal 2. So technically speaking,
> proposal 1 is more flexible and allows users to choose either flavor. On
> the other hand, one could argue that proposal 1 may confuse users with
> these two flavors. Although personally I feel it is clear to me, I am open
> to other ideas.
>
> 3. Lastly, there was a concern about proposal 1 is that some Estimators
> can no longer be added to the Pipeline while the original Pipeline accepts
> any Estimator.
>
> It seems that users have to always make sure the input schema required by
> the Estimator matches the input table. So even for the existing Pipeline,
> people cannot naively add any Estimator into a pipeline. Admittedly,
> proposal 1 added some more requirements, namely 1) the number of inputs
> needs to match the number of outputs of the previous stage, and 2) the
> Estimator does not generate a transformer with different required input
> schema (the heterogeneous case mentioned above). However, given that these
> mismatches will result in exceptions at compile time, just like users put
> an Estimator with mismatched input schema, personally I find it does not
> change the user experience much.
>
>
> So to summarize my thoughts on this fundamental difference.
>     - In general, I personally prefer having one set of API.
>     - The current proposal 1 may need some improvements in some cases, by
> borrowing something from proposal 2.
>
>
>
> A few other differences that I consider as non-fundamental:
>
> * Do we need a top level encapsulation API for an Algorithm? *
>
> Proposal 1 has a concept of Graph which encapsulates the entire algorithm
> to provide a unified API following the same semantic of
> Estimator/Transformer. Users can choose not to package everything into a
> Graph, but just write their own program and wrap it as an ordinary function.
>
> Proposal 2 does not have the top level API such as Graph. Instead, users
> can choose to write an arbitrary function if they want to.
>
> From what I understand, in proposal 1, users may still choose to ignore
> Graph API and simply construct a DAG by themselves by calling transform()
> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as
> I suggested earlier. So Graph is just an additional way to construct a
> graph - people can use Graph in a similar way as they do to the
> Pipeline/Pipeline model. In another word, there is no conflict between
> proposal 1 and proposal 2.
>
>
> * The ways to describe a Graph? *
>
> Proposal 1 gives two ways to construct a DAG.
> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well).
> 2. using the GraphBuilder API.
>
> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
> main output and some other side outputs, so it can call
> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
> the cost of wrapping all the Tables into an AlgoOperator.
>
> The usability argument was mostly around the raw APIs. I don't think the
> two APIs differ too much from each other. With the same assumption,
> proposal 1 and proposal 2 can probably achieve very similar levels of
> usability when describing a Graph, if not exactly the same.
>
>
> There are some more other differences/arguments mentioned between the two
> proposals. However, I don't think they are fundamental. And just like the
> cases mentioned above, the two proposals can easily learn from each other.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <[email protected]> wrote:
>
>> Hi all,
>>
>> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
>> different
>> designs to extend Flink ML API to support more use-cases, e.g. expressing
>> a
>> DAG of preprocessing and training logics. These two designs have been
>> documented in FLIP-173
>> <
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
>> >
>> 。
>>
>> We have different opinions on the usability and the ease-of-understanding
>> of the proposed APIs. It will be really useful to have comments of those
>> designs from the open source community and to learn your preferences.
>>
>> To facilitate the discussion, we have summarized our design principles and
>> opinions in this Google doc
>> <
>> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
>> >.
>> Code snippets for a few example use-cases are also provided in this doc to
>> demonstrate the difference between these two solutions.
>>
>> This Flink ML API is super important to the future of Flink ML library.
>> Please feel free to reply to this email thread or comment in the Google
>> doc
>> directly.
>>
>> Thank you!
>> Dong, Zhipeng, Fan
>>
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to