Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Dong Lin Tue, 20 Jul 2021 06:55:21 -0700

Hi Becket,

Thank you for the detailed reply!


My understanding of your comments is that most of option-1 looks good
except its change of the Transformer semantics. Please see my reply inline.


On Tue, Jul 20, 2021 at 11:43 AM Becket Qin <[email protected]> wrote:

> Hi Dong, Zhipeng and Fan,
>
> Thanks for the detailed proposals. It is quite a lot of reading! Given that
> we are introducing a lot of stuff here, I find that it might be easier for
> people to discuss if we can list the fundamental differences first. From
> what I understand, the very fundamental difference between the two
> proposals is following:
>
> * In order to support graph structure, do we extend Transformer/Estimator,
> or do we introduce a new set of API? *
>
> Proposal 1 tries to keep one set of API, which is based on
> Transformer/Estimator/Pipeline. More specifically, it does the following:
>     - Make Transformer and Estimator multi-input and multi-output (MIMO).
>     - Introduce a Graph/GraphModel as counter parts of
> Pipeline/PipelineModel.
>
> Proposal 2 leaves the existing Transformer/Estimator/Pipeline as is.
> Instead, it introduces AlgoOperators to support the graph structure. The
> AlgoOperators are general-purpose graph nodes supporting MIMO. They are
> independent of Pipeline / Transformer / Estimator.
>
>
> My two cents:
>
> I think it is a big advantage to have a single set of API rather than two
> independent sets of API, if possible. But I would suggest we change the
> current proposal 1 a little bit, by learning from proposal 2.
>
> What I like about proposal 1:
> 1. A single set of API, symmetric in Graph/GraphModel and
> Pipeline/PipelineModel.
> 2. Keeping most of the benefits from Transformer/Estimator, including the
> fit-then-transform relation and save/load capability.
>
> However, proposal 1 also introduced some changes that I am not sure about:
>
> 1. The most controversial part of proposal 1 is whether we should extend
> the Transformer/Estimator/Pipeline? In fact, different projects have
> slightly different designs for Transformer/Estimator/Pipeline. So I think
> it is OK to extend it. However, there are some commonly recognized core
> semantics that we ideally want to keep. To me these core semantics are:
>   1. Transformer is a Data -> Data conversion, Estimator deals with Data ->
> Model conversion.
>   2. Estimator.fit() gives a Transformer, and users can just call
> Transformer.transform() to perform inference.
> To me, as long as these core semantics are kept, extension to the API seems
> acceptable.
>
> Proposal 1 extends the semantic of Transformer from Data -> Data conversion
> to generic Table -> Table conversion, and claims it is equivalent to
>

I am a bit confused by this description of the semantic change. By "from
Data -> Data conversion to generic Table -> Table", do you mean "Table !=
Data"?

I am hoping we can clearly define the semantics of Transformer and agree on
its semantics definition within the community. Then we can see how option-1
changes its semantics and update option-1 as appropriate.

"AlgoOperator" in proposal 2 as a general-purpose graph node. It does
> change the first semantic. That said, this might just be a naming problem,
> though. One possible solution to this problem is having a new subclass of
> Stage without strong conventional semantics, e.g. "AlgoOp" if we borrow the
> name from proposal 2, and let Transformer extend it. Just like a
>

Assuming that option-1 does change the meaning of Transformer (this point
depends on the answer to the previous question), I am happy to update
option-1 to address the concern.

Here is how I would update option-1:
- Add an interface named "MLOperator". This interface will be the same as
the existing Transformer interface in option-1 except its name.
- Make Transformer a subclass of MLOperator. Because Transformer has
exactly the same API as MLOperator, no extra method is defined in
Transformer.
- Every other API/class in option-1 which deals with Transformer will now
deal with MLOperator.

This change proposed above basically renamed Transformer to address the
semantic/naming concern. Does this sound good?


> PipelineModel is a more specific Transformer, a Transformer would be a more
> specific "AlgoOp". If we do that, the processing logic that people don't
> feel comfortable to be a Transformer can just be put into an "AlgoOp" and
> thus can still be added to a Pipeline / Graph. This borrows the advantage
> of proposal 2. In another word, this essentially makes the "AlgoOp"
> equivalent of "AlgoOperator" in proposal 2, but allows it to be added to
> the Graph and Pipeline if people want to.
>
> This also gives my thoughts regarding the concern that making the
> Transformer/Estimator to MIMO would break the convention of single input
> single output (SISO) Transformer/Estimator. Since this does not change the
> core semantic of Transformer/Estimator, it sounds an intuitive extension to
> me.
>
> 2. Another semantic related case brought up was heterogeneous topologies in
> training and inference. In that case, the input of an Estimator would be
> different from the input of the transformer returned by Estimator.fit().
> The example to this case is Word2Vec, where the input of the Estimator
> would be an article while the input to the Transformer is a single word.
> The well recognized ML Pipeline doesn't seem to support this case, because
> it assumes the input of the Estimator and corresponding Transformer are the
> same.
>
> Both proposal 1 and proposal 2 leaves this case unsupported in the
> Pipeline. To support this case,
>    - Proposal 1 adds support to such cases in the Graph/GraphModel by
> introducing "EstimatorInput" and "TransformerInput". The downside is that
> it complicates the API.
>    - Proposal 2 leaves this to users to construct two different DAG for
> training and inference respectively. This means users would have to
> construct the DAG twice even if most parts of the DAG are the same in
> training and inference.
>
> My gut feeling is that this is not a critical difference because such
> heterogeneous topology is sort of a corner case. Most users do not need to
> worry about this. For those who do need this, either proposal 1 and
> proposal 2 seems acceptable to me. That said, it looks that with proposal
> 1, users can still choose to write the program twice without using the
> Graph API, just like what they do in proposal 2. So technically speaking,
> proposal 1 is more flexible and allows users to choose either flavor. On
> the other hand, one could argue that proposal 1 may confuse users with
> these two flavors. Although personally I feel it is clear to me, I am open
> to other ideas.
>
> 3. Lastly, there was a concern about proposal 1 is that some Estimators can
> no longer be added to the Pipeline while the original Pipeline accepts any
> Estimator.
>
> It seems that users have to always make sure the input schema required by
> the Estimator matches the input table. So even for the existing Pipeline,
> people cannot naively add any Estimator into a pipeline. Admittedly,
> proposal 1 added some more requirements, namely 1) the number of inputs
> needs to match the number of outputs of the previous stage, and 2) the
> Estimator does not generate a transformer with different required input
> schema (the heterogeneous case mentioned above). However, given that these
> mismatches will result in exceptions at compile time, just like users put
> an Estimator with mismatched input schema, personally I find it does not
> change the user experience much.
>
>
> So to summarize my thoughts on this fundamental difference.
>     - In general, I personally prefer having one set of API.
>     - The current proposal 1 may need some improvements in some cases, by
> borrowing something from proposal 2.
>
>
>
> A few other differences that I consider as non-fundamental:
>
> * Do we need a top level encapsulation API for an Algorithm? *
>
> Proposal 1 has a concept of Graph which encapsulates the entire algorithm
> to provide a unified API following the same semantic of
> Estimator/Transformer. Users can choose not to package everything into a
> Graph, but just write their own program and wrap it as an ordinary
> function.
>
> Proposal 2 does not have the top level API such as Graph. Instead, users
> can choose to write an arbitrary function if they want to.
>
> From what I understand, in proposal 1, users may still choose to ignore
> Graph API and simply construct a DAG by themselves by calling transform()
> and fit(), or calling AlgoOp.process() if we add "AlgoOp" to proposal 1 as
> I suggested earlier. So Graph is just an additional way to construct a
> graph - people can use Graph in a similar way as they do to the
> Pipeline/Pipeline model. In another word, there is no conflict between
> proposal 1 and proposal 2.
>
>
> * The ways to describe a Graph? *
>
> Proposal 1 gives two ways to construct a DAG.
> 1. the raw API using Estimator/Transformer(potentially "AlgoOp" as well).
> 2. using the GraphBuilder API.
>
> Proposal 2 only gives the raw API of AlgoOpertor. It assumes there is a
> main output and some other side outputs, so it can call
> algoOp1.linkFrom(algoOp2) without specifying the index of the output, at
> the cost of wrapping all the Tables into an AlgoOperator.
>
> The usability argument was mostly around the raw APIs. I don't think the
> two APIs differ too much from each other. With the same assumption,
> proposal 1 and proposal 2 can probably achieve very similar levels of
> usability when describing a Graph, if not exactly the same.
>
>
> There are some more other differences/arguments mentioned between the two
> proposals. However, I don't think they are fundamental. And just like the
> cases mentioned above, the two proposals can easily learn from each other.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Jul 1, 2021 at 7:29 PM Dong Lin <[email protected]> wrote:
>
> > Hi all,
> >
> > Zhipeng, Fan (cc'ed) and I are opening this thread to discuss two
> different
> > designs to extend Flink ML API to support more use-cases, e.g.
> expressing a
> > DAG of preprocessing and training logics. These two designs have been
> > documented in FLIP-173
> > <
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=184615783
> > >
> > 。
> >
> > We have different opinions on the usability and the ease-of-understanding
> > of the proposed APIs. It will be really useful to have comments of those
> > designs from the open source community and to learn your preferences.
> >
> > To facilitate the discussion, we have summarized our design principles
> and
> > opinions in this Google doc
> > <
> >
> https://docs.google.com/document/d/1L3aI9LjkcUPoM52liEY6uFktMnFMNFQ6kXAjnz_11do
> > >.
> > Code snippets for a few example use-cases are also provided in this doc
> to
> > demonstrate the difference between these two solutions.
> >
> > This Flink ML API is super important to the future of Flink ML library.
> > Please feel free to reply to this email thread or comment in the Google
> doc
> > directly.
> >
> > Thank you!
> > Dong, Zhipeng, Fan
> >
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to