Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Zhipeng Zhang Tue, 10 Aug 2021 19:03:55 -0700

Hi Timo, Becket,

Thanks for the feedback.


I agree that having named table can help the code more readable. No matter
there is one output table or multiple output tables, users have to access
an output table by a magic index (For the case that there is only one
output table, we need to use index zero.), which is somehow hard to read.

My point is that can we adopt the idea in Option-2 that we distinguish the
main-output and side-output by getOutputTable() and getSideOutputs() in
AlgoOperator API?
As an Alink developer (Alibaba's machine learning library on Flink,
https://github.com/alibaba/Alink), we do find that many machine learning
algorithms have only one output table, and getOutputTable() is more
frequently used by accessing other output tables.


```
Table output =
   transformer7.transform(
   transformer6.transform(
   transformer5.transform(
   transformer4.transform(
   tranformers3.transform(
     transformer2.transform(input2)[0], transformer1.transform(input1)[0]
   )[0])[0])[0])[0])[0])[0];
```

For example, the above case in getOutputTable() and getSideOutputs() API
will be written as:

   Table output1 = op1.compute(intput1).getOutputTable();
   Table output2 = op2.compute(input2).getOutputTable();
   Table output3 = op3.compute(output1, output2).getOutputTable();
   Table output4 = op4.compute(input3).getOutputTable();
   Table output5 = op5.compute(input4).getOutputTable();
   Table output6 = op6.compute(input5).getOutputTable();
   Table output = op7.compute(input6).getOutputTable();

BTW, in Option-2, we proposed AlgoOperator::linkFrom() and
AlgoOperator::link() to users to better support building machine learning
DAGs. In AlgoOperator case, the above code can be simply written as:

AlgoOperator output = stage3

     .linkFrom(input1.link(stage1), input2.link(stage2))

     .link(stage4)

     .link(stage5)

     .link(stage6)

     .link(stage7);

Table outputTable = output.getOutputTable();

Note:
(1) linkFrom() encapsulates the computation logic of this AlgoOperator.
Only the first output table of each input will be used in the computation.
(2) A.link(B) equals to B.linkFrom(A)



Becket Qin <[email protected]> 于2021年8月11日周三 上午8:49写道：

> Thanks for the feedback, Mingliang.
>
> Dong, I think what Mingliang meant by option-2 is the second way mentioned
> in my email, i.e. having a Graph encapsulation. It does not mean the option
> 2 in the FLIP. So he actually meant option 1 of the FLIP. Mingliang can
> correct me if I misunderstood.
>
> Hi Timo,
>
> Thanks for taking a look at the FLIP and giving the feedback.
>
> Having named output tables could be helpful. It would make the code more
> readable. That said, we might want to keep both index-based retrieval and
> name-based retrieval. This is because the usefulness of index and named
> tables may depend on the number of outputs we have. For example, if most of
> the Transformer / Estimator only has one output, indexes are probably more
> concise. Asking users always get the output by name could be a little
> verbose, plus users have to also first find out the name of the output. On
> the other hand, in some other cases, if a stage has a lot of output tables,
> named output would help.
>
> Another thing is that users can always assign an output table to a
> variable, which is equivalent to the named output except the name is user
> defined. For example,
>    Table transformerOutput1 = transformer1.transform(intput1)[0];
>    Table transformerOutput2 = transformer2.transform(input2[0],
> transformerOutput1)[0];
>    Table transformerOutput3 =
> transformer3.transform(transformerOutput2)[0];
>    Table transformerOutput4 =
> transformer3.transform(transformerOutput3)[0];
>    Table transformerOutput5 =
> transformer3.transform(transformerOutput4)[0];
>    Table transformerOutput6 =
> transformer3.transform(transformerOutput5)[0];
>    Table output = transformer7.transform(transformerOutput6);
>
> Does this provide a similar experience as the named output to the users?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Aug 10, 2021 at 4:04 PM Timo Walther <[email protected]> wrote:
>
> > Hi everyone,
> >
> > I'm not deeply involved in the discussion but I quickly checked out the
> > proposed interfaces because it seems they are using Table API heavily
> > and would like to leave some feedback here:
> >
> > I have the feeling that the proposed interfaces are a bit too simplified.
> >
> > Methods like `Table[] transform(Table... inputs)` are very difficult to
> > handle because they involve a lot of array index magic for implementers
> > and users. Also the examples are hard to read because of all the index
> > arithmetic going on:
> >
> >
> > Table output =
> >    transformer7.transform(
> >    transformer6.transform(
> >    transformer5.transform(
> >    transformer4.transform(
> >    tranformers3.transform(
> >      transformer2.transform(input2)[0], transformer1.transform(input1)[0]
> >    )[0])[0])[0])[0])[0])[0];
> >
> >
> >
> > Table[] compute(Table... inputs) {
> >          Table output1 = new AOp(...).compute(inputs[0])[0];
> >          Table output2 = new AOp(...).compute(inputs[1])[0];
> >          return new BTrainOp(...).compute(output1, output2);
> >      }
> >
> >
> > Especially for larger pipelines, it will be difficult to distinguish
> > between main output, statistics and other side outputs.
> >
> > Wouldn't it be better to introduce a new concept (maybe even on Table
> > API level), to express a modular API operator that takes and returns
> > multiple tables. Ideally, those parameters and results would be named
> > and/or tagged such that the following operator can easily distinguish
> > the different result tables and pick what is needed.
> >
> > That would make the interfaces a bit more complicated but help
> > standardizing the communication between modular operators.
> >
> > Of course this would need a separate design discussion, but also non-ML
> > users in Table API could benefit from.
> >
> > Regards,
> > Timo
> >
> >
> > On 10.08.21 07:28, Dong Lin wrote:
> > > Thank you Mingliang for providing the comments.
> > >
> > > Currently option-1 proposes Graph/GraphModel/GraphBuilder to build an
> > > Estimator from a graph of Estimator/Transformer, where Estimator could
> > > generate the model (as a Transformer) directly. On the other hand,
> > option-2
> > > proposes AlgoOperator that can be linked into a graph of AlgoOperator.
> > >
> > > It seems that option-1 is closer to what TF does than option-2. Could
> you
> > > double check whether you mean option-1 or option-2?
> > >
> > >
> > >
> > >
> > > On Tue, Aug 10, 2021 at 11:29 AM 青雉（祁明良） <[email protected]> wrote:
> > >
> > >> Vote for option 2.
> > >> It is similar to what we are doing with Tensorflow.
> > >> 1. Define the graph in training phase
> > >> 2. Export model with different input/output spec for online inference
> > >>
> > >> Thanks,
> > >> Mingliang
> > >>
> > >> On Aug 10, 2021, at 9:39 AM, Becket Qin <[email protected]<mailto:
> > >> [email protected]>> wrote:
> > >>
> > >> estimatorInputs
> > >>
> > >>
> > >>
> > >>
> >
> 本?件及其附件含有小??公司的保密信息，?限于?送?以上收件人或群?。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、?制、或散?）本?件中的信息。如果??收了本?件，??立即??或?件通知?件人并?除本?件！
> > >> This communication may contain privileged or other confidential
> > >> information of Red. If you have received it in error, please advise
> the
> > >> sender by reply e-mail and immediately delete the message and any
> > >> attachments without copying or disclosing the contents. Thank you.
> > >>
> >
> >
>


-- 
best,
Zhipeng

Re: [DISCUSS] FLIP-173: Support DAG of algorithms (Flink ML)

Reply via email to