Hi Vignesh,
In addition to what Ron has said, there are a number of options to consider,
depending on the nature of your calculations:
Given that your main focus seems to be latency:
* As Ron has said, Flink manages parallelism in a coarse grained way that
is optimized for spending as little time as possible in synchronization, and
takes away the need to manually synchronize
* If you spawn your own threads (it’s possible) you need to manage
synchronization yourself and this can add considerable to latency and create
back-pressure
* You would combine two implementations of parallelism and probably end
up in threads stealing CPU resources from each other
* When planning for horizontal scalability in Flink, plan CPU resources so
they can manage the workload
Fanout and collection pattern in general is a good idea, given that in order to
allow for horizontal scaling you need to have at least one network shuffle
anyway
* Make sure you use the best serializer possible for your data.
* Out of the box, Pojo serializer is hard to top, a hand coded serializer
might help (keep this for later in you dev process)
* If you can arrange you problem so that operators can be chained into a
single chain you can avoid serialization within the chain
* In Flink, if you union() or connect() multiple streams, chaining is
interrupted and this adds considerably to latency
There is a neat trick to combine the parallelism-per-key-group and the
parallelism-per-algorithm into a single implementation and end up with single
chains with little de-/serialization except for the fanout
* One of my students has devised this scheme in his masters thesis (see [1]
chapter 4.4.1 pp. 69)
* With his implementation we reduces back-pressure and latency
significantly for some orders of magnitude
I hope this helps, feels free to discuss details 😊
Thias
[1] Master Thesis, Dominik Bünzli, University of Zurich, 2021:
https://www.merlin.uzh.ch/contributionDocument/download/14168
From: liu ron <[email protected]>
Sent: Dienstag, 15. August 2023 03:54
To: Vignesh Kumar Kathiresan <[email protected]>
Cc: [email protected]
Subject: Re: Recommendations on using multithreading in flink map functions in
java
Hi, Vignesh
Flink is a distributed parallel computing framework, each MapFunction is
actually a separate thread. If you want more threads to process the data, you
can increase the parallelism of the MapFunction without having to use multiple
threads in a single MapFunction, which in itself violates the original design
intent of Flink.
Best,
Ron
Vignesh Kumar Kathiresan via user
<[email protected]<mailto:[email protected]>> 于2023年8月15日周二 03:59写道:
Hello All,
Problem statement
For a given element, I have to perform multiple(lets say N) operations on it.
All the N operations are independent of each other. And for achieving lowest
latency, I want to do them concurrently. I want to understand what's the best
way to perform it in flink?.
I understand flink achieves huge parallelism across elements. But is it
anti-pattern to do parallel processing in a map func at single element level? I
do not see anything on the internet for using multithreading inside a map
function.
I can always fan out with multiple copies of the same element and send them to
different operators. But it incurs at the least a serialize/deserialize cost
and may also incur network shuffle. Trying to see if a multithreaded approach
is better.
Thanks,
Vignesh
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng
verboten.
This message is intended only for the named recipient and may contain
confidential or privileged information. As the confidentiality of email
communication cannot be guaranteed, we do not accept any responsibility for the
confidentiality and the intactness of this message. If you have received it in
error, please advise the sender by return e-mail and delete this message and
any attachments. Any unauthorised use or dissemination of this information is
strictly prohibited.