Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Becket Qin Wed, 11 Aug 2021 02:24:40 -0700

Hi Zhipeng,

It looks like there are three different but potentially related things
here.
1. How to describe multiple output of a node in the DAG.
2. How to construct / describe the DAG.
3. Do we need an encapsulation class of a DAG, e.g. the Graph class in
option 1?


It is much easier to discuss them if we agree on the fundamental issue I
mentioned above. So the following discussion assumes that we follow the
one-set-of-API approach.

Regarding 1, there are two independent suggestions. Timo suggested named
outputs, and you suggested distinguishing main output from side outputs.
Even with the main output / side outputs, named output may still be useful
because there might be multiple side outputs. If we only look at the node
itself, distinguishing main output from side outputs introduces another
concept, but only avoids the index on the main output. So merely from the
multiple output description perspective, this doesn't seem very useful. The
main benefit of doing this is because it seems to make the DAG construction
/ description easier, which brings us to the second point.

Regarding 2, the way proposed in option 1 to describe a DAG is indeed a
little verbose, partly due to its generalized multiple output API. So if
introducing the side outputs helps make it simpler, it looks like a good
improvement to me.

Regarding 3, first of all, users can still write programs by calling fit(),
transform() or compute() by themselves even without the encapsulation class
of DAG. It is just like users can just call Estimator.fit() and
Transformer.transform() without using the Pipeline. However, if users want
to reuse the DAG and potentially connect the DAG to another bigger DAG as
an Estimator or transformer, encapsulation is necessary. In this case, the
DAG description API mentioned in (2) needs to work well with the
encapsulation as well.

To sum up, I think distinguishing main output and side outputs, which
enables link / linkFrom, can help reduce the verbosity when describing the
DAG. So I am open to this option. However, what's unclear to me is how
link/linkFrom would work with the encapsulation case in (3). Do you have
some ideas for that?

Thanks,

Jiangjie (Becket) Qin


On Wed, Aug 11, 2021 at 10:04 AM Zhipeng Zhang <zhangzhipe...@gmail.com>
wrote:

> Hi Timo, Becket,
>
> Thanks for the feedback.
>
> I agree that having named table can help the code more readable. No matter
> there is one output table or multiple output tables, users have to access
> an output table by a magic index (For the case that there is only one
> output table, we need to use index zero.), which is somehow hard to read.
>
> My point is that can we adopt the idea in Option-2 that we distinguish the
> main-output and side-output by getOutputTable() and getSideOutputs() in
> AlgoOperator API?
> As an Alink developer (Alibaba's machine learning library on Flink,
> https://github.com/alibaba/Alink), we do find that many machine learning
> algorithms have only one output table, and getOutputTable() is more
> frequently used by accessing other output tables.
>
>
> ```
> Table output =
>    transformer7.transform(
>    transformer6.transform(
>    transformer5.transform(
>    transformer4.transform(
>    tranformers3.transform(
>      transformer2.transform(input2)[0], transformer1.transform(input1)[0]
>    )[0])[0])[0])[0])[0])[0];
> ```
>
> For example, the above case in getOutputTable() and getSideOutputs() API
> will be written as:
>
>    Table output1 = op1.compute(intput1).getOutputTable();
>    Table output2 = op2.compute(input2).getOutputTable();
>    Table output3 = op3.compute(output1, output2).getOutputTable();
>    Table output4 = op4.compute(input3).getOutputTable();
>    Table output5 = op5.compute(input4).getOutputTable();
>    Table output6 = op6.compute(input5).getOutputTable();
>    Table output = op7.compute(input6).getOutputTable();
>
> BTW, in Option-2, we proposed AlgoOperator::linkFrom() and
> AlgoOperator::link() to users to better support building machine learning
> DAGs. In AlgoOperator case, the above code can be simply written as:
>
> AlgoOperator output = stage3
>
>      .linkFrom(input1.link(stage1), input2.link(stage2))
>
>      .link(stage4)
>
>      .link(stage5)
>
>      .link(stage6)
>
>      .link(stage7);
>
> Table outputTable = output.getOutputTable();
>
> Note:
> (1) linkFrom() encapsulates the computation logic of this AlgoOperator.
> Only the first output table of each input will be used in the computation.
> (2) A.link(B) equals to B.linkFrom(A)
>
>
>
> Becket Qin <becket....@gmail.com> 于2021年8月11日周三 上午8:49写道：
>
> > Thanks for the feedback, Mingliang.
> >
> > Dong, I think what Mingliang meant by option-2 is the second way
> mentioned
> > in my email, i.e. having a Graph encapsulation. It does not mean the
> option
> > 2 in the FLIP. So he actually meant option 1 of the FLIP. Mingliang can
> > correct me if I misunderstood.
> >
> > Hi Timo,
> >
> > Thanks for taking a look at the FLIP and giving the feedback.
> >
> > Having named output tables could be helpful. It would make the code more
> > readable. That said, we might want to keep both index-based retrieval and
> > name-based retrieval. This is because the usefulness of index and named
> > tables may depend on the number of outputs we have. For example, if most
> of
> > the Transformer / Estimator only has one output, indexes are probably
> more
> > concise. Asking users always get the output by name could be a little
> > verbose, plus users have to also first find out the name of the output.
> On
> > the other hand, in some other cases, if a stage has a lot of output
> tables,
> > named output would help.
> >
> > Another thing is that users can always assign an output table to a
> > variable, which is equivalent to the named output except the name is user
> > defined. For example,
> >    Table transformerOutput1 = transformer1.transform(intput1)[0];
> >    Table transformerOutput2 = transformer2.transform(input2[0],
> > transformerOutput1)[0];
> >    Table transformerOutput3 =
> > transformer3.transform(transformerOutput2)[0];
> >    Table transformerOutput4 =
> > transformer3.transform(transformerOutput3)[0];
> >    Table transformerOutput5 =
> > transformer3.transform(transformerOutput4)[0];
> >    Table transformerOutput6 =
> > transformer3.transform(transformerOutput5)[0];
> >    Table output = transformer7.transform(transformerOutput6);
> >
> > Does this provide a similar experience as the named output to the users?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Aug 10, 2021 at 4:04 PM Timo Walther <twal...@apache.org> wrote:
> >
> > > Hi everyone,
> > >
> > > I'm not deeply involved in the discussion but I quickly checked out the
> > > proposed interfaces because it seems they are using Table API heavily
> > > and would like to leave some feedback here:
> > >
> > > I have the feeling that the proposed interfaces are a bit too
> simplified.
> > >
> > > Methods like `Table[] transform(Table... inputs)` are very difficult to
> > > handle because they involve a lot of array index magic for implementers
> > > and users. Also the examples are hard to read because of all the index
> > > arithmetic going on:
> > >
> > >
> > > Table output =
> > >    transformer7.transform(
> > >    transformer6.transform(
> > >    transformer5.transform(
> > >    transformer4.transform(
> > >    tranformers3.transform(
> > >      transformer2.transform(input2)[0],
> transformer1.transform(input1)[0]
> > >    )[0])[0])[0])[0])[0])[0];
> > >
> > >
> > >
> > > Table[] compute(Table... inputs) {
> > >          Table output1 = new AOp(...).compute(inputs[0])[0];
> > >          Table output2 = new AOp(...).compute(inputs[1])[0];
> > >          return new BTrainOp(...).compute(output1, output2);
> > >      }
> > >
> > >
> > > Especially for larger pipelines, it will be difficult to distinguish
> > > between main output, statistics and other side outputs.
> > >
> > > Wouldn't it be better to introduce a new concept (maybe even on Table
> > > API level), to express a modular API operator that takes and returns
> > > multiple tables. Ideally, those parameters and results would be named
> > > and/or tagged such that the following operator can easily distinguish
> > > the different result tables and pick what is needed.
> > >
> > > That would make the interfaces a bit more complicated but help
> > > standardizing the communication between modular operators.
> > >
> > > Of course this would need a separate design discussion, but also non-ML
> > > users in Table API could benefit from.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > On 10.08.21 07:28, Dong Lin wrote:
> > > > Thank you Mingliang for providing the comments.
> > > >
> > > > Currently option-1 proposes Graph/GraphModel/GraphBuilder to build an
> > > > Estimator from a graph of Estimator/Transformer, where Estimator
> could
> > > > generate the model (as a Transformer) directly. On the other hand,
> > > option-2
> > > > proposes AlgoOperator that can be linked into a graph of
> AlgoOperator.
> > > >
> > > > It seems that option-1 is closer to what TF does than option-2. Could
> > you
> > > > double check whether you mean option-1 or option-2?
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Aug 10, 2021 at 11:29 AM 青雉（祁明良） <m...@xiaohongshu.com>
> wrote:
> > > >
> > > >> Vote for option 2.
> > > >> It is similar to what we are doing with Tensorflow.
> > > >> 1. Define the graph in training phase
> > > >> 2. Export model with different input/output spec for online
> inference
> > > >>
> > > >> Thanks,
> > > >> Mingliang
> > > >>
> > > >> On Aug 10, 2021, at 9:39 AM, Becket Qin <becket....@gmail.com
> <mailto:
> > > >> becket....@gmail.com>> wrote:
> > > >>
> > > >> estimatorInputs
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
> 本?件及其附件含有小??公司的保密信息，?限于?送?以上收件人或群?。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、?制、或散?）本?件中的信息。如果??收了本?件，??立即??或?件通知?件人并?除本?件！
> > > >> This communication may contain privileged or other confidential
> > > >> information of Red. If you have received it in error, please advise
> > the
> > > >> sender by reply e-mail and immediately delete the message and any
> > > >> attachments without copying or disclosing the contents. Thank you.
> > > >>
> > >
> > >
> >
>
>
> --
> best,
> Zhipeng
>

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to