Hi Oscar,

The rebalance operation will go over the network stack, but not necessarily 
involving remote data shuffle. For data shuffling between tasks of the same 
node, the local channel is used, but compared to chained operators, it still 
introduces extra data serialization overhead. For data shuffling between tasks 
on different nodes, remote network shuffling is involved.

Therefore, breaking the chain with extra rebalance operation will definitely 
add extra overhead. But usually, it is negligible under a small parallelism 
setting like yours. Could you share the exception details thrown after the 
change?
________________________________
From: Oscar Perez via user <user@flink.apache.org>
Sent: Monday, April 15, 2024 15:57
To: Oscar Perez via user <user@flink.apache.org>; pi-team <pi-t...@n26.com>; 
Hermes Team <hermes-t...@n26.com>
Subject: Flink job performance

Hi community!

We have an interesting problem with Flink after increasing parallelism in a 
certain way. Here is the summary:

1)  We identified that our job bottleneck were some Co-keyed process operators 
that were affecting on previous operators causing backpressure.
2( What we did was to increase the parallelism to all the operators from 6 to 
12 but keeping 6 these operators that read from kafka. The main reason was that 
all our topics have 6 partitions so increasing the parallelism will not yield 
better performance

See attached job layout prior and after the changes:
What happens was that some operations that were chained in the same operator 
like reading - filter - map - filter now are rebalanced and the overall 
performance of the job is suffering (keeps throwing exceptions now and then)

Is the rebalance operation going over the network or this happens in the same 
node? How can we effectively improve performance of this job with the given 
resources?

Thanks for the input!
Regards,
Oscar


Reply via email to