Re: [blog article] Howto migrate a real-life batch pipeline from the DataSet API to the DataStream API

Etienne Chauchot Wed, 09 Nov 2022 03:18:49 -0800

Hi,

And by the way, I was planing on writing another article to compare theperformances of DataSet, DataStream and SQL APIs over TPCDS query3. Ithought that I could run the pipelines on an Amazon EMR cluster withdifferent data sizes 1GB, 100GB, 1TB.


Would it be worth it, what do you think ?

Best

Etienne

Le 09/11/2022 à 10:04, Etienne Chauchot a écrit :

Hi Yun Gao,

thanks for your email and your review !

My comments are inline

Le 08/11/2022 à 06:51, Yun Gao a écrit :
Hi Etienne,
Very thanks for the article! Flink is currently indeed keepingincreasing theability of unified batch / stream processing with the same api, andits a greatpleasure that more and more users are trying this functionality. ButI also
have some questions regarding some details.
First IMO, as a whole for the long run Flink will have two unifiedAPIs, namely Table / SQLAPI and DataStream API. Users could express the computation logicwith these two APIs
for both bounded and unbounded data processing.
Yes that is what I understood also throughout the discussions andjiras. And I also think IMHO that reducing the number of APIs to 2 wasthe good move.
Underlying Flink provides two
execution modes: the streaming mode works with both bounded andunbounded data,and it executes in a way of incremental processing based on state;the batch mode worksonly with bounded data, and it executes in a ways level-by-levelsimilar to the traditional
batch processing frameworks. Users could switch the execution mode via
EnvironmentSettings.inBatchMode() forStreamExecutionEnvironment.setRuntimeMode().
As recommended in Flink docs(1) I have enabled the batch mode as Ithough it would be more efficient on my bounded pipeline but as amatter of fact the streaming mode seems to be more efficient on my usecase. I'll test with higher volumes to confirm.
Specially for DataStream, as implemented in FLIP-140, currently allthe existing DataStreamoperation supports the batch execution mode in a unified way[1]: data will be sorted for thekeyBy() edges according to the key, then the following operationslike reduce() could receiveall the data belonging to the same key consecutively, then it coulddirectly reducing the recordsof the same key without maintaining the intermediate states. In thisway users could write the
same code for both streaming and batch processing with the same code.
Yes I have no doubt that my resulting Query3ViaFlinkRowDatastreampipeline will work with no modification if I plug an unbounded sourceto it.
# Regarding the migration of Join / Reduce
First I think Reduce is always supported and users could writedataStream.keyBy().reduce(xx)directly, and if batch execution mode is set, the reduce will notbe executed in a incremental way,instead is acts much like sort-based aggregation in the traditionalbatch processing framework.
Regarding Join, although the issue of FLINK-22587 indeed exists:current join has to be boundto a window and the GlobalWindow does not work properly, but withsome more try currentlyit does not need users to re-write the whole join from scratch:Users could write a dedicatedwindow assigner that assigns all the records to the same windowinstance and returnEventTimeTrigger.create() as the default event-time trigger [2]. Thenit works
source1.join(source2)
                .where(a -> a.f0)
                .equalTo(b -> b.f0)
                .window(new EndOfStreamWindows())
                .apply(xxxx);
It does not requires records have event-time attached since thetrigger of window is onlyrelying on the time range of the window and the assignment does notneed event-time either.
The behavior of the join is also similar to sort-based join if batchmode is enabled.
Of course it is not easy to use to let users do the workaround andwe'll try to fix this issue in 1.17.
Yes, this is a better workaround than the manual state-based join thatI proposed. I tried it and it works perfectly with similarperformance. Thanks.
# Regarding support of Sort / Limit
Currently these two operators are indeed not supported in theDataStream API directly. One initialthough for these two operations are that users may convert theDataStream to Table API and use
Table API for these two operators:

DataStream<xx> xx = ... // Keeps the customized logic in DataStream
Table tableXX = tableEnv.fromDataStream(dataStream);
tableXX.orderBy($("a").asc());
Yes I knew that workaround but I decided not to use it because I havea special SQL based implementation (for comparison reasons) so I didnot want to mix SQL and DataStream APIs in the same pipeline.
How do you think about this option? We are also assessing if thecombination of DataStreamAPI / Table API is sufficient for all the batch users. Anysuggestions are warmly welcome.
I guess that outside of my use case of comparing the performance ofthe 3 Flink APIs (broader subject than this article), users can easilymix the APIs in the same pipeline. If we really want to have theseoperations in the DataStream API maybe wrapping state-basedimplementations could be good if their performance meets our expectations.
Best,
Yun Gao
I'll update the article and the code with your suggestions. Thanks again.
[1]https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/execution_mode/#when-canshould-i-use-batch-execution-mode
Best

Etienne
[1]https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/[2]https://github.com/apache/flink-ml/blob/master/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/EndOfStreamWindows.java
    ------------------------------------------------------------------
    From:liu ron <ron9....@gmail.com>
    Send Time:2022 Nov. 8 (Tue.) 10:21
    To:dev <d...@flink.apache.org>; Etienne Chauchot
    <echauc...@apache.org>; user <user@flink.apache.org>
    Subject:Re: [blog article] Howto migrate a real-life batch
    pipeline from the DataSet API to the DataStream API

    Thanks for your post, It looks very good to me, also maybe for
    developers,

    Best,
    Liudalong

    yuxia <luoyu...@alumni.sjtu.edu.cn> 于2022年11月8日周二 09:11写道：
    Wow, cool!  Thanks for your work.
    It'll be definitely helpful for the users that want to migrate
    their batch job from DataSet API to DataStream API.

    Best regards,
    Yuxia

    ----- 原始邮件 -----
    发件人: "Etienne Chauchot" <echauc...@apache.org>
    收件人: "dev" <d...@flink.apache.org>, "User" <user@flink.apache.org>
    发送时间: 星期一, 2022年 11 月 07日 下午 10:29:54
    主题: [blog article] Howto migrate a real-life batch pipeline from
    the DataSet API to the DataStream API

    Hi everyone,

    In case some of you are interested, I just posted a blog article
    about
    migrating a real-life batch pipeline from the DataSet API to the
    DataStream API:

    
https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html

    Best

    Etienne

Re: [blog article] Howto migrate a real-life batch pipeline from the DataSet API to the DataStream API

Reply via email to