Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Becket Qin Mon, 19 Jul 2021 20:42:58 -0700

Hi Dong, Zhipeng and Fan,

Thanks for the detailed proposals. It is quite a lot of reading! Given that
we are introducing a lot of stuff here, I find that it might be easier for
people to discuss if we can list the fundamental differences first. From
what I understand, the very fundamental difference between the two
proposals is following:

* In order to support graph structure, do we extend Transformer/Estimator,
or do we introduce a new set of API? *

Proposal 1 tries to keep one set of API, which is based on
Transformer/Estimator/Pipeline. More specifically, it does the following:
    - Make Transformer and Estimator multi-input and multi-output (MIMO).
    - Introduce a Graph/GraphModel as counter parts of
Pipeline/PipelineModel.

Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
Instead, it introduces AlgoOperators to support the graph structure. The
AlgoOperators are general-purpose graph nodes supporting MIMO. They are
independent of Pipeline / Transformer / Estimator.

My two cents:

I think it is a big advantage to have a single set of API rather than two
independent sets of API, if possible. But I would suggest we change the
current proposal 1 a little bit, by learning from proposal 2.

What I like about proposal 1:
1. A single set of API, symmetric in Graph/GraphModel and
Pipeline/PipelineModel.
2. Keeping most of the benefits from Transformer/Estimator, including the
fit-then-transform relation and save/load capability.

However, proposal 1 also introduced some changes that I am not sure about:

1. The most controversial part of proposal 1 is whether we should extend
the Transformer/Estimator/Pipeline? In fact, different projects have
slightly different designs for Transformer/Estimator/Pipeline. So I think
it is OK to extend it. However, there are some commonly recognized core
semantics that we ideally want to keep. To me these core semantics are:
  1. Transformer is a Data -> Data conversion, Estimator deals with Data ->
Model conversion.
  2. Estimator.fit() gives a Transformer, and users can just call
Transformer.transform() to perform inference.
To me, as long as these core semantics are kept, extension to the API seems
acceptable.

Proposal 1 extends the semantic of Transformer from Data -> Data conversion
to generic Table -> Table conversion, and claims it is equivalent to
"AlgoOperator" in proposal 2 as a general-purpose graph node. It does
change the first semantic. That said, this might just be a naming problem,
though. One possible solution to this problem is having a new subclass of
Stage without strong conventional semantics, e.g. "AlgoOp" if we borrow the
name from proposal 2, and let Transformer extend it. Just like a
PipelineModel is a more specific Transformer, a Transformer would be a more
specific "AlgoOp". If we do that, the processing logic that people don't
feel comfortable to be a Transformer can just be put into an "AlgoOp" and
thus can still be added to a Pipeline / Graph. This borrows the advantage
of proposal 2. In another word, this essentially makes the "AlgoOp"
equivalent of "AlgoOperator" in proposal 2, but allows it to be added to
the Graph and Pipeline if people want to.

This also gives my thoughts regarding the concern that making the
Transformer/Estimator to MIMO would break the convention of single input
single output (SISO) Transformer/Estimator. Since this does not change the
core semantic of Transformer/Estimator, it sounds an intuitive extension to
me.

2. Another semantic related case brought up was heterogeneous topologies in
training and inference. In that case, the input of an Estimator would be
different from the input of the transformer returned by Estimator.fit().
The example to this case is Word2Vec, where the input of the Estimator
would be an article while the input to the Transformer is a single word.
The well recognized ML Pipeline doesn't seem to support this case, because
it assumes the input of the Estimator and corresponding Transformer are the
same.

Both proposal 1 and proposal 2 leaves this case unsupported in the
Pipeline. To support this case,
   - Proposal 1 adds support to such cases in the Graph/GraphModel by
introducing "EstimatorInput" and "TransformerInput". The downside is that
it complicates the API.
   - Proposal 2 leaves this to users to construct two different DAG for
training and inference respectively. This means users would have to
construct the DAG twice even if most parts of the DAG are the same in
training and inference.

My gut feeling is that this is not a critical difference because such
heterogeneous topology is sort of a corner case. Most users do not need to
worry about this. For those who do need this, either proposal 1 and
proposal 2 seems acceptable to me. That said, it looks that with proposal
1, users can still choose to write the program twice without using the
Graph API, just like what they do in proposal 2. So technically speaking,
proposal 1 is more flexible and allows users to choose either flavor. On
the other hand, one could argue that proposal 1 may confuse users with
these two flavors. Although personally I feel it is clear to me, I am open
to other ideas.

3. Lastly, there was a concern about proposal 1 is that some Estimators can
no longer be added to the Pipeline while the original Pipeline accepts any
Estimator.

It seems that users have to always make sure the input schema required by
the Estimator matches the input table. So even for the existing Pipeline,
people cannot naively add any Estimator into a pipeline. Admittedly,
proposal 1 added some more requirements, namely 1) the number of inputs
needs to match the number of outputs of the previous stage, and 2) the
Estimator does not generate a transformer with different required input
schema (the heterogeneous case mentioned above). However, given that these
mismatches will result in exceptions at compile time, just like users put
an Estimator with mismatched input schema, personally I find it does not
change the user experience much.

So to summarize my thoughts on this fundamental difference.
    - In general, I personally prefer having one set of API.
    - The current proposal 1 may need some improvements in some cases, by
borrowing something from proposal 2.

A few other differences that I consider as non-fundamental:

* Do we need a top level encapsulation API for an Algorithm? *

Proposal 1 has a concept of Graph which encapsulates the entire algorithm
to provide a unified API following the same semantic of
Estimator/Transformer. Users can choose not to package everything into a
Graph, but just write their own program and wrap it as an ordinary function.

Proposal 2 does not have the top level API such as Graph. Instead, users
can choose to write an arbitrary function if they want to.

>From what I understand, in proposal 1, users may still choose to ignore
Graph API and simply construct a DAG by themselves by calling transform()
and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as
I suggested earlier. So Graph is just an additional way to construct a
graph - people can use Graph in a similar way as they do to the
Pipeline/Pipeline model. In another word, there is no conflict between
proposal 1 and proposal 2.

* The ways to describe a Graph? *

Proposal 1 gives two ways to construct a DAG.
1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well).
2. using the GraphBuilder API.

Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
main output and some other side outputs, so it can call
algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
the cost of wrapping all the Tables into an AlgoOperator.

The usability argument was mostly around the raw APIs. I don't think the
two APIs differ too much from each other. With the same assumption,
proposal 1 and proposal 2 can probably achieve very similar levels of
usability when describing a Graph, if not exactly the same.

There are some more other differences/arguments mentioned between the two
proposals. However, I don't think they are fundamental. And just like the
cases mentioned above, the two proposals can easily learn from each other.

Thanks,

Jiangjie (Becket) Qin

On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <lindon...@gmail.com> wrote:

> Hi all,
>
> Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two different
> designs to extend Flink ML API to support more use-cases, e.g. expressing a
> DAG of preprocessing and training logics. These two designs have been
> documented in FLIP-173
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
> >
> 。
>
> We have different opinions on the usability and the ease-of-understanding
> of the proposed APIs. It will be really useful to have comments of those
> designs from the open source community and to learn your preferences.
>
> To facilitate the discussion, we have summarized our design principles and
> opinions in this Google doc
> <
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
> >.
> Code snippets for a few example use-cases are also provided in this doc to
> demonstrate the difference between these two solutions.
>
> This Flink ML API is super important to the future of Flink ML library.
> Please feel free to reply to this email thread or comment in the Google doc
> directly.
>
> Thank you!
> Dong, Zhipeng, Fan
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to