Hi Vignesh,

In addition to what Ron has said, there are a number of options to consider, 
depending on the nature of your calculations:

Given that your main focus seems to be latency:

  *   As Ron has said, Flink manages parallelism in a coarse grained way that 
is optimized for spending as little time as possible in synchronization, and 
takes away the need to manually synchronize
  *   If you spawn your own threads (it’s possible) you need to manage 
synchronization yourself and this can add considerable to latency and create 
back-pressure
     *   You would combine two implementations of parallelism and probably end 
up in threads stealing CPU resources from each other
  *   When planning for horizontal scalability in Flink, plan CPU resources so 
they can manage the workload
Fanout and collection pattern in general is a good idea, given that in order to 
allow for horizontal scaling you need to have at least one network shuffle 
anyway

  *   Make sure you use the best serializer possible for your data.
  *   Out of the box, Pojo serializer is hard to top, a hand coded serializer 
might help (keep this for later in you dev process)
  *   If you can arrange you problem so that operators can be chained into a 
single chain you can avoid serialization within the chain
  *   In Flink, if you union() or connect() multiple streams, chaining is 
interrupted and this adds considerably to latency
There is a neat trick to combine the parallelism-per-key-group and the 
parallelism-per-algorithm into a single implementation and end up with single 
chains with little de-/serialization except for the fanout

  *   One of my students has devised this scheme in his masters thesis (see [1] 
chapter 4.4.1 pp. 69)
  *   With his implementation we reduces back-pressure and latency 
significantly for some orders of magnitude

I hope this helps, feels free to discuss details 😊

Thias



[1] Master Thesis, Dominik Bünzli, University of Zurich, 2021: 
https://www.merlin.uzh.ch/contributionDocument/download/14168



From: liu ron <ron9....@gmail.com>
Sent: Dienstag, 15. August 2023 03:54
To: Vignesh Kumar Kathiresan <vkath...@yahooinc.com>
Cc: user@flink.apache.org
Subject: Re: Recommendations on using multithreading in flink map functions in 
java

Hi, Vignesh

Flink is a distributed parallel computing framework, each MapFunction is 
actually a separate thread. If you want more threads to process the data, you 
can increase the parallelism of the MapFunction without having to use multiple 
threads in a single MapFunction, which in itself violates the original design 
intent of Flink.

Best,
Ron

Vignesh Kumar Kathiresan via user 
<user@flink.apache.org<mailto:user@flink.apache.org>> 于2023年8月15日周二 03:59写道:
Hello All,

Problem statement
For a given element, I have to perform multiple(lets say N) operations on it. 
All the N operations are independent of each other. And for achieving lowest 
latency, I want to do them concurrently. I want to understand what's the best 
way to perform it in flink?.

I understand flink achieves huge parallelism across elements. But is it 
anti-pattern to do parallel processing in a map func at single element level? I 
do not see anything on the internet for using multithreading inside a map 
function.

I can always fan out with multiple copies of the same element and send them to 
different operators. But it incurs at the least a serialize/deserialize cost 
and may also incur network shuffle. Trying to see if a multithreaded approach 
is better.

Thanks,
Vignesh
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet 
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von 
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine 
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser 
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per 
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche 
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng 
verboten.

This message is intended only for the named recipient and may contain 
confidential or privileged information. As the confidentiality of email 
communication cannot be guaranteed, we do not accept any responsibility for the 
confidentiality and the intactness of this message. If you have received it in 
error, please advise the sender by return e-mail and delete this message and 
any attachments. Any unauthorised use or dissemination of this information is 
strictly prohibited.

Reply via email to